Learn R Programming

ClustMMDD (version 1.0.3)

em.cluster.R: Compute estimates of the parameters by Expectation and Maximization algorithm.

Description

Compute an approximation of the maximum likelihood estimates of parameters using Expectation and Maximization (EM) algorithm. A maximum a posteriori classification is then derived from the estimated set of parameters.

Usage

em.cluster.R(xdata, K, S, ploidy = 1, emOptions = list(epsi = NULL,
  typeSmallEM = NULL, typeEM = NULL, nberSmallEM = NULL, nberIterations = NULL,
  nberMaxIterations = NULL, putThreshold = NULL), cte = 1)

Arguments

xdata
A matrix of strings with the number of columns equal to ploidy * (number of variables).
K
The number of clusters (or populations).
S
The subset of clustering variables in the form of a vector of logicals indicating the selected variables. $S$ gathers variables that are not identically distributed in at least two clusters.
ploidy
The number of unordered observations represented by a string in xdata. For example, for genotypic data from diploid individual, $ploidy = 2$.
emOptions
A list of EM options (see EmOptions and setEmOptions).
cte
A double used as a value of $\lambda$ in the penalty function $pen(K,S)=\lambda*dim\left(K,S\right)$, where $dim\left(K,S\right)$ is the number of free parameters in the model defined by $\left(K,S\right)$.

Value

  • A list of
    • N :
    {The size (number of lines) of the dataset.}
  • K :The number of clusters (populations).
  • S :A vector of logicals indicating the selected variables for clustering.
  • dim :The number of free parameters.
  • pi_K :The vector of mixing proportions.
  • prob :A list of matrices, each matrix being the probabilities of a variable in different clusters.
  • logLik :The log-likelihood.
  • entropy :The entropy.
  • criteria :Criteria values c(BIC, AIC, ICL, CteDim).
  • Tik :A stochastic matrix given the a posteriori membership probabilities.
  • mapClassif :Maximum a posteriori classification.
  • NbersLevels :The numbers of observed levels of the considered categorical variables.
  • levels :The observed levels.

References

  • http://projecteuclid.org/euclid.ejs/1379596773{Dominique Bontemps and Wilson Toussile (2013)} : Clustering and variable selection for categorical multivariate data. Electronic Journal of Statistics, Volume 7, 2344-2371, ISSN.
  • http://link.springer.com/article/10.1007%2Fs11634-009-0043-x{Wilson Toussile and Elisabeth Gassiat (2009)} : Variable selection in model-based clustering using multilocus genotype data. Adv Data Anal Classif, Vol 3, number 2, 109-134.

See Also

dataR2C for transformation of a classic data frame, backward.explorer, selectK.R, dimJump.R, model.selection.R for both model selection and classification.

Examples

Run this code
data(genotype1)
head(genotype1)
genotype2 = cutEachCol(genotype1[, -11], ploidy = 2)
head(genotype2)

#See the EM options
EmOptions() # Options can be set by \code{\link{setEmOptions()}}
par5 = em.cluster.R (genotype2, K = 5, S = c(rep(TRUE, 8), rep(FALSE, 2)), ploidy = 2)
slotNames(par5)
head(par5["membershipProba"])
par5["mixingProportions"]
par5

Run the code above in your browser using DataLab