CNmixt: Fitting for the Parsimonious Mixtures of Contaminated Normal Distributions

Description

Fits, by using the expectation conditional-maximization (ECM) algorithm, parsimonious mixtures of multivariate contaminated normal distributions (with eigen-decomposed scale matrices) to the given data within a clustering paradigm (default) or classification paradigm. Can be run in parallel. Likelihood-based model selection criteria are used to select the parsimonious model and the number of groups.

Usage

CNmixt(X, G, model = NULL, initialization = "mixt",
   alphafix = NULL, alphamin = 0.5, etafix = NULL, etamax = 1000,
   seed = NULL, start.z = NULL, start.v = NULL, start = 0,
   ind.label = NULL, label = NULL, iter.max = 1000, threshold = 1.0e-03, 
   parallel = FALSE, eps = 1e-100)

Arguments

a matrix or data frame such that $n$ rows correspond to observations and $p$ columns correspond to variables. Note that this function currently only works with multivariate data ($p > 1$).

a vector containing the numbers of groups to be tried.

model

a vector indicating the model(s) to be fitted. Possible values are: "EII", "VII", "EEI", "VEI", "EVI", "VVI", "EEE", "VEE", "EVE",

initialization

initialization strategy for the ECM algorithm. It can be:

"mixt"(default): the initial ($n \times G$) soft classification matrix (of posterior probabilities of groups membership) arises from a preliminary run of mixtures of mul

alphafix

a vector of length $G$ with the proportion of good observations in each group. If length(alphafix) != G, then the first element is replicated $G$ times. Default value is NULL.

alphamin

a vector of length $G$ with the minimum proportion of good observations in each group. If length(alphamin) != G, then the first element is replicated $G$ times. Default value is 0.5.

etafix

a vector of length $G$ with the values of the contamination parameter to be considered in the estimation phase for each group. If length(etafix) != G, then the first element is replicated $G$ times. Default value is NULL

etamax

a vector of length $G$ with the maximum value for the contamination parameter to be considered in the estimation phase for each group when etafix is NULL. If length(etamax) != G, then the first element is repli

seed

the seed for the random number generator, when random initializations are used; if NULL, current seed is not changed. Default value is NULL.

start.z

initial $n \times G$ matrix of either soft or hard classification. Default value is NULL.

start.v

initial $n \times G$ matrix of posterior probabilities to be a good observation in each group. Default value is a $n \times G$ matrix of ones.

start

when initialization = "mixt", initialization used for the gpcm() function of the mixture package (see mixture:gpcm for details).

ind.label

vector of positions (rows) of the labeled observations.

label

vector, of the same dimension as ind.label, with the group of membership of the observations indicated by ind.label.

iter.max

maximum number of iterations in the ECM algorithm. Default value is 1000.

threshold

threshold for Aitken's acceleration procedure. Default value is 1.0e-03.

parallel

When TRUE, the package parallel is used for parallel computation. When several models are estimated, computational time is reduced. The number of cores to use

eps

an optional scalar. It sets the smallest value for the eigenvalues of the component scale matrices. Default value is 1e-100.

Value

An object of class ContaminatedMixt is a list with components:
callan object of class call
besta data frame with the best number of mixture components (first column) and the best model (second column) with respect to the two model selection criteria adopted (BIC and ICL)
bestBIC,bestICL
for the best BIC and ICL models, these are two lists (of the same type) with components:
- modelname: the name of the best model.
- npar: number of free parameters.
- X: matrix of data.
- G: number of mixture components.
- p: number of variables.
- prior: weights for the mixture components.
- priorgood: weights for the good observations in each of thekgroups.
- mu: component means.
- Sigma: component covariance matrices for the good observations.
- eta: component contamination parameters.
- iter.stop: final iteration of the ECM algorithm.
- z: matrix with posterior probabilities for the outer groups.
- v: matrix with posterior probabilities for the inner groups.
- ind.label: vector of positions (rows) of the labeled observations.
- label: vector, of the same dimension asind.label, with the group of membership of the observations indicated byind.label.
- group: vector of integers indicating the maximum a posteriori classifications for the best model.
- loglik: log-likelihood value of the best model.
- BIC: BIC value
- ICL:ICL value
- call: an object of classcallfor the best model.

Details

The multivariate data contained in X are either clustered or classified using parsimonious mixtures of multivariate contaminated normal distributions with some or all of the 14 parsimonious models described in Punzo and McNicholas (2015). Model specification (via the model argument) follows the nomenclature popularized in other packages such as mixture and mclust. Such a nomenclature refers to the decomposition and constraints on the scale matrix (see Banfield and Raftery, 1993, Celeux and Govaert, 1995 and Punzo and McNicholas, 2015 for details): $$\Sigma_g = \lambda_g \Gamma_g \Delta_g \Gamma_g'.$$ The nomenclature describes (in order) the volume ($\lambda_g$), shape ($\Delta_g$), and orientation ($\Gamma_g$), in terms of "V"ariable, "E"qual, or the "I"dentity matrix. As an example, the string "VEI" would refer to the model where $\Sigma_g = \lambda_g \Delta$. Note that for $G=1$, several models are equivalent (for example, "EEE" and "VVV"). Thus, for $G=1$ only one model from each set of equivalent models will be run. The algorithms detailed in Celeux and Govaert (1995) are considered in the first CM-step of the ECM algorithm to update $\Sigma_g$ for all the models apart from "EVE" and "VVE". For "EVE" and "VVE", majorization-minimization (MM) algorithms (Hunter and Lange, 2000) and accelerated line search algorithms on the Stiefel manifold (Absil, Mahony and Sepulchre, 2009 and Browne and McNicholas, 2014), which are especially preferable in higher dimensions (Browne and McNicholas, 2014), are used to update $\Sigma_g$; the same approach is also adopted in the mixture package for those models. Starting values are very important to the successful operation of these algorithms and so care must be taken in the interpretation of results. All the initializations considered here provide initial quantities for the first CM-step of the ECM algorithm.

References

Absil, P. A., Mahony, R. and Sepulchre, R. (2009). Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ. Banfield, J. D. and Raftery A. E. (1993). Model-Based Gaussian and Non-Gaussian Clustering. Biometrics, 49(3), 803--821. Browne, R. P. and McNicholas, P. D. (2013). Estimating Common Principal Components in High Dimensions. Advances in Data Analysis and Classification, 8(2), 217--226. Browne, R. P. and McNicholas, P. D. (2014). Orthogonal Stiefel manifold optimization for eigen-decomposed covariance parameter estimation in mixture models. Statistics and Computing, 24(2), 203--210. Browne, R. P. and McNicholas, P. D. (2015). mixture: Mixture Models for Clustering and Classification. R package version 1.4. Celeux, G. and Govaert, G. (1995). Gaussian Parsimonious Clustering Models. Pattern Recognition. 28(5), 781--793. Hunter, D. R. and Lange, K. (2000). Rejoinder to Discussion of ``Optimization Transfer Using Surrogate Objective Functions''. Journal of Computational and Graphical Statistics, 9(1), 52--59. Punzo, A. and McNicholas, P. D. (2015). Parsimonious mixtures of contaminated Gaussian distributions with application to allometric studies. arXiv.org e-print 1305.4669, available at: http://arxiv.org/abs/1305.4669.

Examples

Run this code

## Note that the example is extremely simplified 
## in order to reduce computation time

# Artificial data from an EEI Gaussian mixture with G = 2 components

library("mnormt")
p <- 2
set.seed(12345)
X1 <- rmnorm(n = 200, mean = rep(2, p), varcov = diag(c(5, 0.5)))
X2 <- rmnorm(n = 200, mean = rep(-2, p), varcov = diag(c(5, 0.5)))
noise <- matrix(runif(n = 40, min = -20, max = 20), nrow = 20, ncol = 2)
X <- rbind(X1, X2, noise)

group <- rep(c(1, 2, 3), times = c(200, 200, 20))
plot(X, col = group, pch = c(3, 4, 16)[group], asp = 1, xlab = expression(X[1]),
ylab = expression(X[2]))

# ---------------------- #
# Model-based clustering #
# ---------------------- #

res1 <- CNmixt(X, model = c("EEI", "VVV"), G = 2, parallel = FALSE)

summary(res1)

agree(res1, givgroup = group)

plot(res1, contours = TRUE, asp = 1, xlab = expression(X[1]), ylab = expression(X[2]))

# -------------------------- #
# Model-based classification #
# -------------------------- #

indlab <- sample(1:400, 20)
lab <- group[indlab]
res2 <- CNmixt(X, G = 2, model = "EEI", ind.label = indlab, label = lab)

agree(res2, givgroup = group)

Run the code above in your browser using DataLab