tclustIC: Performs cluster analysis by calling `tclustfsda` for different number of groups `k` and restriction factors `c`

Description

Computes the values of BIC (MIXMIX), ICL (MIXCLA) or CLA (CLACLA), for different values of k (number of groups) and different values of c (restriction factor), for a prespecified level of trimming (the last two letters in the name stand for 'Information Criterion'). If Parallel Computing toolbox is installed, parfor is used to compute tclust for different values of c. In order to minimize randomness, given k, the same subsets are used for each value of c.

Usage

tclustIC(x, kk = 1:5, cc = c(1, 2, 4, 8, 16, 32, 64, 128), alpha = 0,
  whichIC = c("ALL", "MIXMIX", "MIXCLA", "CLACLA"), nsamp,
  refsteps = 15, reftol = 1e-14, equalweights = FALSE, msg = TRUE,
  nocheck = FALSE, plot = FALSE, startv1 = 1,
  restrtype = c("eigen", "deter"), UnitsSameGroup, numpool, cleanpool,
  trace = FALSE, ...)

Arguments

An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables.

Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.

an integer vector specifying the number of mixture components (clusters) for which the BIC is to be calculated. By default kk=1:5.

an vector specifying the values of the restriction factor which have to be considered. By default cc=c(1, 2, 4, 8, 16, 32, 64, 128).

alpha

Global trimming level. A scalar between 0 and 0.5 or an integer specifying the number of observations which have to be trimmed. If alpha=0 all observations are considered. By default alpha=0.

More in detail, if 0 < alpha < 1 clustering is based on h = fix(n * (1-alpha)) observations, else if alpha is an integer greater than 1 clustering is based on h = n - floor(alpha).

whichIC

A character value which specifies which information criteria must be computed for each k (number of groups) and each value of the restriction factor c. Possible values for whichIC are:

"MIXMIX": a mixture model is fitted and for computing the information criterion the mixture likelihood is used. This option corresponds to the use of the Bayesian Information criterion (BIC). In output just the matrix MIXMIX is given.
"MIXCLA": a mixture model is fitted but to compute the information criterion the classification likelihood is used. This option corresponds to the use of the Integrated Complete Likelihood (ICL). In the output just the matrix MIXCLA is given.
"CLACLA": everything is based on the classification likelihood. This information criterion will be called CLA. In the output just the matrix CLACLA is given.
"ALL": both classification and mixture likelihood are used. In this case all three information criteria CLA, ICL and BIC are computed. In the output all three matrices MIXMIX, MIXCLA and CLACLA are given.

nsamp

If a scalar, it contains the number of subsamples which will be extracted. If nsamp = 0 all subsets will be extracted. Remark - if the number of all possible subset is greater than 300 the default is to extract all subsets, otherwise just 300. If nsamp is a matrix it contains in the rows the indexes of the subsets which have to be extracted. nsamp in this case can be conveniently generated by function subsets(). nsamp can have k columns or k * (p + 1) columns. If nsamp has k columns the k initial centroids each iteration i are given by X[nsamp[i,] ,] and the covariance matrices are equal to the identity.

If nsamp has k * (p + 1) columns, the initial centroids and covariance matrices in iteration i are computed as follows:

X1 <- X[nsamp[i ,] ,]
mean(X1[1:p + 1, ]) contains the initial centroid for group 1
cov(X1[1:p + 1, ]) contains the initial cov matrix for group 1
mean(X1[(p + 2):(2*p + 2), ]) contains the initial centroid for group 2
cov(X1[(p + 2):(2*p + 2), ]) contains the initial cov matrix for group 2
...
mean(X1[(k-1)*p+1):(k*(p+1), ]) contains the initial centroids for group k
cov(X1[(k-1)*p+1):(k*(p+1), ]) contains the initial cov matrix for group k.

REMARK: If nsamp is not a scalar, the option startv1 given below is ignored. More precisely, if nsamp has k columns startv1 = 0 else if nsamp has k*(p+1) columns option startv1=1.

refsteps

Number of refining iterations in each subsample. Default is refsteps=15. refsteps = 0 means "raw-subsampling" without iterations.

reftol

Tolerance of the refining steps. The default value is 1e-14

equalweights

A logical specifying wheather cluster weights in the concentration and assignment steps shall be considered. If equalweights=TRUE we are (ideally) assuming equally sized groups, else if equalweights = false (default) we allow for different group weights. Please, check in the given references which functions are maximized in both cases.

msg

Controls whether to display or not messages on the screen If msg==TRUE (default) messages are displayed on the screen. If msg=2, detailed messages are displayed, for example the information at iteration level.

nocheck

Check input arguments. If nocheck=TRUE no check is performed on matrix X. The default nocheck=FALSE.

plot

If plot=TRUE, a plot of the BIC (MIXMIX), ICL (MIXCLA) curve and CLACLA is shown on the screen. The plots which are shown depend on the input option whichIC.

startv1

How to initialize centroids and covariance matrices. Scalar. If startv1=1 then initial centroids and covariance matrices are based on (p+1) observations randomly chosen, else each centroid is initialized taking a random row of input data matrix and covariance matrices are initialized with identity matrices. The default value isstartv1=1.

Remark 1: in order to start with a routine which is in the required parameter space, eigenvalue restrictions are immediately applied.

Remark 2 - option startv1 is used just if nsamp is a scalar (see for more details the help associated with nsamp).

restrtype

Type of restriction to be applied on the cluster scatter matrices. Valid values are 'eigen' (default), or 'deter'. "eigen" implies restriction on the eigenvalues while "deter" implies restriction on the determinants.

UnitsSameGroup

List of the units which must (whenever possible) have a particular label. For example UnitsSameGroup=c(20, 26), means that group which contains unit 20 is always labelled with number 1. Similarly, the group which contains unit 26 is always labelled with number 2, (unless it is found that unit 26 already belongs to group 1). In general, group which contains unit UnitsSameGroup(r) where r=2, ...length(kk)-1 is labelled with number r (unless it is found that unit UnitsSameGroup(r) has already been assigned to groups 1, 2, ..., r-1.

numpool

The number of parallel sessions to open. If numpool is not defined, then it is set equal to the number of physical cores in the computer.

cleanpool

Logical, indicating if the open pool must be closed or not. It is useful to leave it open if there are subsequent parallel sessions to execute, so that to save the time required to open a new pool.

trace

Whether to print intermediate results. Default is trace=FALSE.

...

potential further arguments passed to lower level functions.

Value

An S3 object of class tclustic.object

References

Cerioli, A., Garcia-Escudero, L.A., Mayo-Iscar, A. and Riani M. (2017). Finding the Number of Groups in Model-Based Clustering via Constrained Likelihoods, emphJournal of Computational and Graphical Statistics, pp. 404-416, https://doi.org/10.1080/10618600.2017.1390469.

Examples

Run this code

# NOT RUN {
 
# }
# NOT RUN {
 data(geyser2)
 out <- tclustIC(geyser2, whichIC="MIXMIX", plot=FALSE, alpha=0.1)
 out
 summary(out)
 
# }

Run the code above in your browser using DataLab