tclustfsda
for different
number of groups k
and restriction factors c
Computes the values of BIC (MIXMIX), ICL (MIXCLA) or CLA (CLACLA),
for different values of k
(number of groups) and different values of c
(restriction factor), for a prespecified level of trimming (the last two letters in the name
stand for 'Information Criterion'). If Parallel Computing toolbox is installed, parfor is
used to compute tclust
for different values of c
. In order to minimize
randomness, given k
, the same subsets are used for each value of c
.
tclustIC(x, kk = 1:5, cc = c(1, 2, 4, 8, 16, 32, 64, 128), alpha = 0,
whichIC = c("ALL", "MIXMIX", "MIXCLA", "CLACLA"), nsamp,
refsteps = 15, reftol = 1e-14, equalweights = FALSE, msg = TRUE,
nocheck = FALSE, plot = FALSE, startv1 = 1,
restrtype = c("eigen", "deter"), UnitsSameGroup, numpool, cleanpool,
trace = FALSE, ...)
An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables.
Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.
an integer vector specifying the number of mixture components (clusters) for which the BIC is to be calculated. By default kk=1:5
.
an vector specifying the values of the restriction factor which have to be considered. By default cc=c(1, 2, 4, 8, 16, 32, 64, 128)
.
Global trimming level. A scalar between 0 and 0.5 or an integer specifying the number of
observations which have to be trimmed. If alpha=0
all observations are considered. By default alpha=0
.
More in detail, if 0 < alpha < 1
clustering is based on h = fix(n * (1-alpha))
observations, else if alpha is an integer greater than 1 clustering is based on h = n - floor(alpha)
.
A character value which specifies which information criteria must be computed
for each k
(number of groups) and each value of the restriction factor c
. Possible values for whichIC
are:
"MIXMIX": a mixture model is fitted and for computing the information criterion
the mixture likelihood is used. This option corresponds to the use of the Bayesian
Information criterion (BIC). In output just the matrix MIXMIX
is given.
"MIXCLA": a mixture model is fitted but to compute the information criterion
the classification likelihood is used. This option corresponds to the use of the
Integrated Complete Likelihood (ICL). In the output just the matrix MIXCLA
is given.
"CLACLA": everything is based on the classification likelihood. This information
criterion will be called CLA. In the output just the matrix CLACLA
is given.
"ALL": both classification and mixture likelihood are used. In this case all
three information criteria CLA, ICL and BIC are computed. In the output all
three matrices MIXMIX
, MIXCLA
and CLACLA
are given.
If a scalar, it contains the number of subsamples which will be extracted.
If nsamp = 0
all subsets will be extracted. Remark - if the number of all possible
subset is greater than 300 the default is to extract all subsets, otherwise just 300.
If nsamp
is a matrix it contains in the rows the indexes of the subsets which
have to be extracted. nsamp
in this case can be conveniently generated by
function subsets()
. nsamp
can have k
columns or k * (p + 1)
columns. If nsamp
has k
columns the k
initial centroids each
iteration i are given by X[nsamp[i,] ,]
and the covariance matrices are equal
to the identity.
If nsamp
has k * (p + 1)
columns, the initial centroids and covariance
matrices in iteration i
are computed as follows:
X1 <- X[nsamp[i ,] ,]
mean(X1[1:p + 1, ]) contains the initial centroid for group 1
cov(X1[1:p + 1, ]) contains the initial cov matrix for group 1
mean(X1[(p + 2):(2*p + 2), ]) contains the initial centroid for group 2
cov(X1[(p + 2):(2*p + 2), ]) contains the initial cov matrix for group 2
...
mean(X1[(k-1)*p+1):(k*(p+1), ]) contains the initial centroids for group k
cov(X1[(k-1)*p+1):(k*(p+1), ]) contains the initial cov matrix for group k.
REMARK: If nsamp
is not a scalar, the option startv1
given below is ignored.
More precisely, if nsamp
has k
columns startv1 = 0
else if
nsamp
has k*(p+1)
columns option startv1=1
.
Number of refining iterations in each subsample. Default is refsteps=15
.
refsteps = 0
means "raw-subsampling" without iterations.
Tolerance of the refining steps. The default value is 1e-14
A logical specifying wheather cluster weights in the concentration
and assignment steps shall be considered. If equalweights=TRUE
we are (ideally)
assuming equally sized groups, else if equalweights = false
(default) we allow for
different group weights. Please, check in the given references which functions
are maximized in both cases.
Controls whether to display or not messages on the screen If msg==TRUE
(default)
messages are displayed on the screen. If msg=2
, detailed messages are displayed,
for example the information at iteration level.
Check input arguments. If nocheck=TRUE
no check is performed
on matrix X
. The default nocheck=FALSE
.
If plot=TRUE
, a plot of the BIC (MIXMIX), ICL (MIXCLA) curve
and CLACLA is shown on the screen. The plots which are shown depend on
the input option whichIC
.
How to initialize centroids and covariance matrices. Scalar.
If startv1=1
then initial centroids and covariance matrices are based
on (p+1)
observations randomly chosen, else each centroid is initialized
taking a random row of input data matrix and covariance matrices are initialized
with identity matrices. The default value isstartv1=1
.
Remark 1: in order to start with a routine which is in the required parameter space, eigenvalue restrictions are immediately applied.
Remark 2 - option startv1
is used just if nsamp
is a scalar
(see for more details the help associated with nsamp
).
Type of restriction to be applied on the cluster scatter matrices.
Valid values are 'eigen'
(default), or 'deter'
.
"eigen"
implies restriction on the eigenvalues while "deter"
implies restriction on the determinants.
List of the units which must (whenever possible) have
a particular label. For example UnitsSameGroup=c(20, 26)
, means that
group which contains unit 20 is always labelled with number 1. Similarly,
the group which contains unit 26 is always labelled with number 2, (unless
it is found that unit 26 already belongs to group 1).
In general, group which contains unit UnitsSameGroup(r)
where r=2, ...length(kk)-1
is labelled with number r
(unless it is found that unit UnitsSameGroup(r)
has already been assigned to groups 1, 2, ..., r-1
.
The number of parallel sessions to open. If numpool is not defined, then it is set equal to the number of physical cores in the computer.
Logical, indicating if the open pool must be closed or not. It is useful to leave it open if there are subsequent parallel sessions to execute, so that to save the time required to open a new pool.
Whether to print intermediate results. Default is trace=FALSE
.
potential further arguments passed to lower level functions.
An S3 object of class tclustic.object
Cerioli, A., Garcia-Escudero, L.A., Mayo-Iscar, A. and Riani M. (2017). Finding the Number of Groups in Model-Based Clustering via Constrained Likelihoods, emphJournal of Computational and Graphical Statistics, pp. 404-416, https://doi.org/10.1080/10618600.2017.1390469.
# NOT RUN {
# }
# NOT RUN {
data(geyser2)
out <- tclustIC(geyser2, whichIC="MIXMIX", plot=FALSE, alpha=0.1)
out
summary(out)
# }
Run the code above in your browser using DataLab