These functions implement the default values for the number of components tried in Gaussian mixture modelling (matching the nbCluster
argument of Rmixmod::mixmodCluster()
). get_nbCluster_range
allows the user to reproduce the internal rules used by Infusion to determine this argument. seq_nbCluster
is a wrapper to the function defined by the nbClu_pow_rule_fn
global option of the package. Its default result is a sequence of integers determined by the number of rows of the data (see Infusion.options
). get_nbCluster_range()
uses additional criteria involving the number of columns of the data to determine the maximum number of clusters. This maximum is controlled by the function defined by the maxnbCluster
global option of the package.
refine_nbCluster
controls the default number of clusters of refine
: it gets the range from seq_nbCluster
and keeps only the maximum value of this range if this maximum is higher than the onlymax
argument.
Adventurous users can change the rules used by Infusion by changing the global options nbClu_pow_rule_fn
and maxnbCluster
(while conforming to the interfaces of these functions). Less ambitiously, they can for example use the maximum value of the result of get_nbCluster_range()
as a single reasonable value for the nbCluster
argument of infer_SLik_joint
.
seq_nbCluster(nr, nc=(nr/500+2)/3)
refine_nbCluster(nr, nc, onlymax=7)
get_nbCluster_range(projdata, nr = nrow(projdata), nc = ncol(projdata),
nbCluster = seq_nbCluster(nr, nc), verbose=TRUE)
An integer vector
data frame: the data to be clustered, which typically include parameters and projected summary statistics;
integer: number of rows of the data to be clustered;
integer: see Description;
integer: number of columns of the data to be clustered, typically twice the number of estimated parameters (except if latent variables are included);
integer or vector of integers: candidate values, which feasability is checked by the function.
boolean. Whether to print some information, or not.
The default upper value of the nbCluster
range is controlled by two rules:
*
The first rule sets the maximum number of clusters as function of the number of samples \(n\) in the reference table. The default rule nr^(0.31-0.08/nc)
is close to the value \(n^{0.3}\) irecommended in the mixmod
statistical documentation (Mixmod Team, 2016).
*
This first rule is corrected by a second rule setting a maximum dependent also on the dimensions of the projdata
(the one used internally for clustering, which typically differs from the dimensions of the user-level data
, if projections have been applied, in particular). This second rule is controlled by the maxnbCluster
option.
For large number of points, experience shows that the maximum value derived from these two rules rules is practically always selected by AIC. So, in practice it is faster to only perform clustering with this maximum number of cluster, rather than to perform AIC-based selection among a range of number of clusters. This rule is implemented as the default for argument nbCluster
of refine.default
, by its default value specified by refine_nbCluster
.
# Determination of number of clusters when attempting to estimate
# 20 parameters from a reference table with 30000 rows:
seq_nbCluster(nr=30000L)
get_nbCluster_range(nr=30000L, nc=40L) # nc = *twice* the number of parameters
Run the code above in your browser using DataLab