ClustMMDD-package: `ClustMMDD` : Clustering by Mixture Models for Discrete Data.

Description

ClustMMDD stands for "Clustering by Mixture Models for Discrete Data". This package deals with the two-fold problem of variable selection and model-based unsupervised classification in discrete settings. Variable selection and classification are simultaneously solved via a model selection procedure using penalized criteria: Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Integrated Completed Log-likelihood (ICL) or a general criterion with penalty function to be data-driven calibrated.

Arguments

Details

ll{ Package: ClustMMDD Type: Package Version: 1.0.1 Date: 2015-05-18 License: GPL (>= 2) }

In this package, K and S are respectively the number of clusters and the subset of variables that are relevant for clustering purposes. We assume that a clustering variable has different probability distributions in at least two clusters, and a non-clustering variable has the same distribution in all clusters. We consider a general situation with data described by $P$ random variables $X^l$, $l=1,\cdots,P$, where each variable $X^l$ is an unordered set $\left{X^{l,1},\cdots,X^{l,ploidy}\right}$ of $ploidy$ categorical variables. For all $l$, the random variables $X^{l,1},\cdots,X^{l,ploidy}$ take their values in the same set of levels. A typical example of such data comes from population genetics where each genotype of a diploid individual is constituted by $ploidy = 2$ unordered alleles. The two-fold problem of clustering and variable selection is seen as a model selection problem. A specific collection of competing models associated to different values of (K, S) is defined, and are compared using penalized criteria. The penalized criteria are of the form $$crit\left(K,S\right)=\gamma_n\left(K,S\right)+pen\left(K,S\right),$$ where

$\gamma_n\left(K,S\right)$is the maximum log-likelihood,
and$pen\left(K,S\right)$the penalty function.

The penalty functions used in this package are the following, where $dim\left(K,S\right)$ is the dimension (number of free parameters) of the model defined by $\left(K,S\right)$ :

Akaike Information Criterion (AIC) :

{$$pen\left(K,S\right) = dim\left(K,S\right)$$} Bayesian Information (BIC) :{$$pen\left(K,S\right) = 0.5*\log (n)*dim\left(K,S\right)$$} Integrated Complete Likelihood (ICL) :{$$pen\left(K,S\right) = 0.5*\log (n)*dim\left(K,S\right)+entropy\left(K,S\right),$$ where $$entropy\left(K,S\right) = -\sum_{i=1}^N\sum_{k=1}^K\tau_{i,k}\log\left(\tau_{i,k}\right)$$ and $$\tau_{i,k}=P\left(i\in\mathcal{C}_k\right)$$.} More general penalty function :{$$pen\left(K,S\right) = \alpha*\lambda*dim\left(K,S\right)$$ where

$\lambda$is a multiplicative parameter to be calibrated,
$\alpha$a coefficient in$[1.5,2]$to be given by the user.

We propose a data driven procedure based the dimension jumb version of the so called "slope heuristics" (see http://projecteuclid.org/euclid.ejs/1379596773{Dominique Bontemps and Wilson Toussile (2013)} and references therein).}

References

http://projecteuclid.org/euclid.ejs/1379596773{Dominique Bontemps and Wilson Toussile (2013)} : Clustering and variable selection for categorical multivariate data. Electronic Journal of Statistics, Volume 7, 2344-2371, ISSN.
http://link.springer.com/article/10.1007%2Fs11634-009-0043-x{Wilson Toussile and Elisabeth Gassiat (2009)} : Variable selection in model-based clustering using multilocus genotype data. Adv Data Anal Classif, Vol 3, number 2, 109-134.

Examples

Run this code

data(genotype2)
head(genotype2)
data(genotype2_ExploredModels)
head(genotype2_ExploredModels)

#Calibration of the penalty function
outDimJump = dimJump.R(genotype2_ExploredModels, N = 1000, h = 5, header = TRUE)
cte1 = outDimJump[[1]][1]
outSlection = model.selection.R(genotype2_ExploredModels, cte = cte1, header = TRUE)
outSlection

Run the code above in your browser using DataLab

Description

Arguments

Details

References

See Also

Examples