MS: Fitting for the Parsimonious Mixtures of Contaminated Gaussian Distributions

Description

Carries out model-based clustering or model-based classification using some or all of the 14 parsimonious mixtures of contaminated Gaussian Distributions by using the ECM algorithm. Likelihood-based model-selection criteria are used to select the best model and the number of mixture components.

Usage

MS(X, k, model = NULL, initialization = "mclust", alphacon = TRUE, alphamin = NULL, alphafix = FALSE, alpha = NULL, etacon = TRUE, etafix = FALSE, eta = NULL, etamax = 200, 
	 start.z = NULL, start.v = NULL, start = 0, ind.label = NULL, label = NULL, iter.max = 1000, threshold = 1.0e-03)

Arguments

A matrix or data frame such that rows correspond to observations and columns correspond to variables. Note that this function currently only works with multivariate data ($p > 1$).

a vector containing the numbers of groups to be tried.

model

vector indicating the models (i.e., the covariance structures: "EII", "VII", "EEI", "VEI", "EVI", "VVI", "EEE", "VEE", "EVE", "EEV", "VVE", "VEV", "EVV", "VVV") to be used. If NULL, then all 14 models are fitted.

initialization

initialization strategy for the ECM-algorithm. It can be:

"mclust": posterior probabilities from mixtures of Gaussian distributions are used for initialization;
"random.soft": initial posterior probabilities are random generated;
"random.hard": initial classification matrix is random generated;
"manual": the user must specify, via the arguments start.z and start.v, posterior probabilities or classification matrix for the mixture components and the 3D array of membership to the ``good'' and ``bad'' groups in each mixture component, respectively.

Default value is "mclust".

alphacon

if TRUE, the vector with proportions of good observations is constrained to be greater than the vector specified by the alphamin argument.

alphamin

when alphacon=TRUE, vector with minimum proportions of good observations in each group.

alphafix

when alphafix=TRUE, the vector of proportions of good observations is fixed to the vector specified in the alpha argument.

alpha

vector of proportions of good observations in each group to be considered when alphafix=TRUE.

etacon

if TRUE, the contaminated parameters are all constrained to be greater than one.

etafix

if TRUE, the vector of contaminated parameters is fixed to the vector specified by the eta argument.

eta

vector of contaminated parameters to be considered when etafix.

etamax

maximum value for the contamination parameters to be considered in the estimation phase when etafix=FALSE.

start.z

matrix of soft or hard classification; it is used only if initialization="manual".

start.v

3D array of soft or hard classification to the good and bad groups in each mixture components. It is used as initialization when initialization="manual".

start

when initialization="manual", initialization used for the gpcm() function of the mixture package (see mixture:gpcm for details).

ind.label

vector of positions (rows) of the labeled observations.

label

vector, of the same dimension as ind.label, with the group of membership of the observations indicated in the ind.label argument.

iter.max

maximum number of iterations in the ECM-algorithm. Default value is 1000.

threshold

threshold for Aitken's acceleration procedure. Default value is 1.0e-03.

Value

call

pmcgdan object of class call

best

a data frame with the best number of mixture components (first column) and the best model (second column) with respect to the three model selection criteria adopted (AIC, BIC, and ICL)

bestAIC,bestBIC,bestICL

for the best AIC, BIC, and ICL models, these are three lists (of the same type) with components:

modelname: the name of the best model.
npar: number of free parameters.
X: matrix of data.
k: number of mixture components.
p: number of variables.
prior: weights for the mixture components.
priorgood: weights for the good observations in each of the k groups.
mu: component means.
Sigma: component covariance matrices for the good observations.
lambda: component volumes for the good observations.
Delta: component shape matrices for the good observations.
Gamma: component orientation matrices for the good observations.
eta: component contamination parameters.
iter.stop: final iteration of the ECM algorithm.
z: matrix with posterior probabilities for the outer groups.
v: matrix with posterior probabilities for the inner groups.
group: vector of integers indicating the maximum a posteriori classifications for the best model.
loglik: log-likelihood value of the best model.
AIC: AIC value
BIC: BIC value
ICL: ICL value
call: an object of class call for the best model.

Details

The multivariate data contained in X are either clustered or classified using parsimonious mixtures of contaminated Gaussian densities with some or all of the 14 parsimonious covariance structures described in Punzo & McNicholas (2013). The algorithms given by Browne & McNicholas (2013) are considered (see also Celeux & Govaert, 1995, for all the models apart from "EVE" and "VVE"). Starting values are very important to the successful operation of these algorithms and so care must be taken in the interpretation of results.

References

Punzo, A., and McNicholas, P. D. (2013). Outlier Detection via Parsimonious Mixtures of Contaminated Gaussian Distributions. arXiv.org e-print 1305.4669, available at: http://arxiv.org/abs/1305.4669.

Browne, R. P. and McNicholas, P. D. (2013). mixture: Mixture Models for Clustering and Classification. R package version 1.0.

Celeux, G. and Govaert, G. (1995). Gaussian Parsimonious Clustering Models. Pattern Recognition. 28(5), 781-793.

Examples

Run this code


# Artificial data from an EEI model with k=2 components

library(mnormt)
p   <- 2
k   <- 2
eta <- c(8,8) # contamination parameters
set.seed(12345)
X1good <- rmnorm(n = 300, mean = rep(3,p), varcov = diag(c(5,0.5)))
X2good <- rmnorm(n = 300, mean = rep(-3,p), varcov = diag(c(5,0.5)))
X1bad  <- rmnorm(n = 30, mean = rep(3,p), varcov = eta[1]*diag(c(5,0.5)))
X2bad  <- rmnorm(n = 30, mean = rep(-3,p), varcov = eta[2]*diag(c(5,0.5)))
X      <- rbind(X1good,X1bad,X2good,X2bad)
plot(X, pch = 16, cex = 0.8)

# model-based clustering with the whole family of 14 
# parsimonious models and number of groups ranging from 1 to 3

overallfit <- MS(X, k = 1:2, model = c("EEI","VVV"), initialization = "mclust")  

# to see the best BIC results

bestBIC <- overallfit$bestBIC

# plot of the best BIC model

plot(X, xlab = expression(X[1]), ylab = expression(X[2]), col = "white")
text(X, labels = bestBIC$detection$innergroup, col = bestBIC$group, cex = 0.7, asp = 1)
box(col = "black")

Run the code above in your browser using DataLab