get_nbCluster_range: Control of number of components in Gaussian mixture modelling

Description

These functions implement the default values for the number of components tried in Gaussian mixture modelling (matching the nbCluster argument of Rmixmod::mixmodCluster()). get_nbCluster_range allows the user to reproduce the internal rules used by Infusion to determine this argument. seq_nbCluster is a wrapper to the function defined by the nbClu_pow_rule_fn global option of the package. Its default result is a sequence of integers determined by the number of rows of the data (see Infusion.options). get_nbCluster_range() uses additional criteria involving the number of columns of the data to determine the maximum number of clusters. This maximum is controlled by the function defined by the maxnbCluster global option of the package.

refine_nbCluster controls the default number of clusters of refine: it gets the range from seq_nbCluster and keeps only the maximum value of this range if this maximum is higher than the onlymax argument.

Adventurous users can change the rules used by Infusion by changing the global options nbClu_pow_rule_fn and maxnbCluster (while conforming to the interfaces of these functions). Less ambitiously, they can for example use the maximum value of the result of get_nbCluster_range() as a single reasonable value for the nbCluster argument of infer_SLik_joint.

Usage

seq_nbCluster(nr, nc=(nr/500+2)/3)
refine_nbCluster(nr, nc, onlymax=7)
get_nbCluster_range(projdata, nr = nrow(projdata), nc = ncol(projdata), 
                    nbCluster = seq_nbCluster(nr, nc), verbose=TRUE)

Value

An integer vector

Arguments

projdata: data frame: the data to be clustered, which typically include parameters and projected summary statistics;
nr: integer: number of rows of the data to be clustered;
onlymax: integer: see Description;
nc: integer: number of columns of the data to be clustered, typically twice the number of estimated parameters (except if latent variables are included);
nbCluster: integer or vector of integers: candidate values, which feasability is checked by the function.
verbose: boolean. Whether to print some information, or not.

Details

The default upper value of the nbCluster range is controlled by two rules:

* The first rule sets the maximum number of clusters as function of the number of samples \(n\) in the reference table. The default rule nr^(0.31-0.08/nc) is close to the value \(n^{0.3}\) irecommended in the mixmod statistical documentation (Mixmod Team, 2016).

* This first rule is corrected by a second rule setting a maximum dependent also on the dimensions of the projdata (the one used internally for clustering, which typically differs from the dimensions of the user-level data, if projections have been applied, in particular). This second rule is controlled by the maxnbCluster option.

For large number of points, experience shows that the maximum value derived from these two rules rules is practically always selected by AIC. So, in practice it is faster to only perform clustering with this maximum number of cluster, rather than to perform AIC-based selection among a range of number of clusters. This rule is implemented as the default for argument nbCluster of refine.default, by its default value specified by refine_nbCluster.

Examples

Run this code

# Determination of number of clusters when attempting to estimate 
#   20 parameters from a reference table with 30000 rows:
seq_nbCluster(nr=30000L)
get_nbCluster_range(nr=30000L, nc=40L) # nc = *twice* the number of parameters

Run the code above in your browser using DataLab