Usage
hddc(data, K = 1:10, model = c("AkjBkQkDk"), threshold = 0.2, criterion = "bic", com_dim = NULL, itermax = 200, eps = 0.001, algo = "EM", d_select = "Cattell", init = "kmeans", init.vector, show = TRUE, mini.nb = c(5, 10), scaling = FALSE, min.individuals = 2, noise.ctrl = 1e-08, mc.cores = 1, nb.rep = 1, keepAllRes = TRUE, kmeans.control = list(), d_max = 100, d)
Arguments
data
A matrix or a data frame of observations, assuming the rows are the observations and the columns the variables. Note that NAs are not allowed.
K
A vector of integers specifying the number of clusters for which the BIC and the parameters are to be calculated; the function keeps the parameters which maximises the criterion
. Default is 1:10.
model
A character string vector, or an integer vector indicating the models to be used. The available models are: "AkjBkQkDk" (default), "AkBkQkDk", "ABkQkDk", "AkjBQkDk", "AkBQkDk", "ABQkDk", "AkjBkQkD", "AkBkQkD", "ABkQkD", "AkjBQkD", "AkBQkD", "ABQkD", "AjBQD", "ABQD". It is not case sensitive and integers can be used instead of names, see details for more information. Several models can be used, if it is, only the results of the one which maximizes the BIC criterion is kept. To run all models, use model="ALL".
threshold
A float stricly within 0 and 1. It is the threshold used in the Cattell's Scree-Test.
criterion
Either BIC or ICL. If several models are run, the best model is selected using the criterion defined by criterion
.
com_dim
It is used only for common dimensions models. The user can give the common dimension he wants. If used, it must be an integer. Its default is set to NULL.
itermax
The maximum number of iterations allowed. The default is 200.
eps
A positive double. It is the stopping criterion: the algorithm stops when the difference between two successive Log Likelihoods is lower than eps.
algo
A character string indicating the algorithm to be used. The available algorithms are the Expectation-Maximisation ("EM"), the Classification E-M ("CEM") and the Stochastic E-M ("SEM"). The default algorithm is the "EM".
d_select
Either Cattell (default) or BIC. See details for more information. This parameter selects which method to use to select the intrinsic dimensions.
init
A character string or a vector of clusters. It is the way to initialize the E-M algorithm. There are five ways of initialization: kmeans (default), param, random, mini-em or vector. See details for more information. It can also be directly initialized with a vector containing the prior classes of the observations.
init.vector
A vector of integers or factors. It is a user-given initialization. It should be of the same length as of the data. Only used when init="vector"
.
show
Use show = FALSE to settle off the informations that may be printed.
mini.nb
A vector of integers of length two. This parameter is used in the mini-em initialization. The first integer sets how many times the algorithm is repeated; the second sets the maximum number of iterations the algorithm will do each time. For example, if init=mini-em and mini.nb=c(5,10), the algorithm wil be lauched 5 times, doing each time 10 iterations; finally the algorithm will begin with the initialization that maximizes the log-likelihood.
scaling
Logical: whether to scale the dataset (mean=0 and standard-error=1 for each variable) or not. By default the data is not scaled.
min.individuals
This parameter is used to control for the minimum population of a class. If the population of a class becomes stricly inferior to 'min.individuals' then the algorithm stops and gives the message: 'pop
noise.ctrl
This parameter avoids to have a too low value of the 'noise' parameter b. It garantees that the dimension selection process do not select too many dimensions (which leads to a potential too low value of the noise parameter b). When selecting the intrinsic dimensions using Cattell's scree-test or BIC, the function doesn't use the eigenvalues inferior to noise.ctrl, so that the intrinsic dimensions selected can't be higher or equal to the order of these eigenvalues.
mc.cores
Positive integer, default is 1. If mc.cores>1
, then parallel computing is used, using mc.cores
cores. Warning for Windows users only: the parallel computing can sometimes be slower than using one single core (due to how parLapply works).
nb.rep
A positive integer (default is 1). Each estimation (i.e. combination of (model, K, threshold)) is repeated nb.rep
times and only the estimation with the highest log-likelihood is kept.
keepAllRes
Logical. Should the results of all runs be kept? If so, an argument all_results
is created in the results. Default is TRUE
.
kmeans.control
A list. The elements of this list should match the parameters of the kmeans initialization (see kmeans
help for details). The parameters are iter.max, nstart and algorithm. d_max
A positive integer. The maximum number of dimensions to be computed. Default is 100. It means that the instrinsic dimension of any cluster cannot be larger than dmax
. It quickens a lot the algorithm for datasets with a large number of variables (e.g. thousands).
d
DEPRECATED. This parameter is kept for retro compatibility. Now please use the parameter d_select.