cmb.em: Conditional mixture modeling by EM algorithm

Description

Runs conditional mixture modeling and model-based clustering by EM algorithm (Expectation Maximization) for a prespecified variables conditioning order. Runs variable selection procedure (forward, backward or stepwise) to achieve a parsimonious mixture model.

Usage

cmb.em(x, order = NULL, l, K, method = "stepwise", id0 = NULL, n.em = 200, em.iter = 5,
EM.iter = 200, nk.min = NULL, max.spur=5, tol = 1e-06, silent = FALSE, Parallel = FALSE,
n.cores = 4)

Value

data: input dataset
model: estimated regression models for each cluster (K x p matrix)
id: vector of estimated membership (length n)
loglik: estimated log likelihood
BIC: Bayesian Information Criterion
Pi: vector of estimated mixing proportions (length K)
tau: matrix of estimated posterior probabilities (n x K)
beta: matrix of estimated regression parameters (K x (p + p(p-1)l/2) )
s2: matrix of estimated variance (K x p)
order: applied conditioning order (length p)
n_pars: number of model parameters

Arguments

x: dataset matrix (n x p)
order: customized variables' conditioning order (length p)
l: order of polynomial regression model
K: number of clusters
method: variable selection method (options 'stepwise', 'forward', 'backward' and 'none')
id0: initial membership vector (length n)
n.em: number of short EM in an emEM procedure
em.iter: maximum number of iterations of short EM in an emEM procedure
EM.iter: maximum number of EM iterations
nk.min: spurious output control
max.spur: number of trials
tol: tolerance level
silent: output control (TRUE/FALSE)
Parallel: parallel computing (TRUE/FALSE)
n.cores: number of cores in parallel computing

Details

In conditional mixture modeling, each component is modeled by a product of conditional distributions with the means expressed by polynomial regression functions depending on other variables. Polynomial regression function order l and the number of clusters K are prespecified by user. The model's initialization can be determined by passing a group membership vector to the argument id, or obtained by the emEM algorithm (the default setting) in the function. There are two arguments related to the emEM procedure, the number of short EM n.em and maximum number of iterations for short EM em.iter. By default, the n.em = 200 and em.iter = 5. The method of variable selection can be specified as method = "stepwise", "forward", "backward", or "none" where method = none means no parsimonious procedure conducted. During the model fitting and variable selection phases, EM algorithm will be applied multiple times, where options EM.iter and tol are stopping criteria of EM iteration. The spurious output control argument nk.min, by default nk.min = (l x (p - 1) + 1) x 2, can be set by user. When spurious output is obtained, cmb.em will be rerun. The maximum number of rerunning is max.spur.

Notation: n - sample size, l - order of polynomial regression model, K - number of mixture components.

References

Biernacki C., Celeux G., Govaert G. (2003). Choosing Starting Values for the EM Algorithm for Getting the Highest Likelihood in Multivariate Gaussian Mixture Models. Computational Statistics and Data Analysis, 41(3-4), pp. 561-575.

Examples

Run this code

set.seed(1)
K <- 3
l <- 2
x <- as.matrix(iris[,-5])
id.true <- iris[,5]
# \donttest{
# Run EM algorithm for fitting a conditioning mixture model 
obj <- cmb.em(x = x, order = c(1,3,2,4), l, K, method = "stepwise", silent = FALSE,
Parallel = FALSE)
id.cmb <- obj$id
table(id.true, id.cmb)
obj$BIC
# }

Run the code above in your browser using DataLab