EmSkew: The EM Algorithm and Skew Mixture Models

Description

As a main function, EmSkew fits the data into the specified multivariate mixture models via the EM Algorithm. Distributions (univariate and multivariate) available include Normal distribution, t-distribution, Skew Normal distribution, and Skew t-distribution.

Usage

EmSkew(dat, g, distr="mvn", ncov=3,clust=NULL,init=NULL,itmax=1000, 
epsilon=1e-6, nkmeans=0, nrandom=10,nhclust=FALSE,debug=TRUE,
initloop=20)

Arguments

dat

The dataset, an n by p numeric matrix, where n is number of observations and p the dimension of data.

The number of components of the mixture model

distr

A three letter string indicating the type of distribution to be fitted, the default value is "mvn", the Normal distribution. See Details.

ncov

A small integer indicating the type of covariance structure; the default value is 3. See Details.

clust

A vector of integers specifying the initial partitions of the data; the default is NULL.

init

A list containing the initial parameters for the mixture model. See details. The default value is NULL.

itmax

A big integer specifying the maximum number of iterations to apply; the default value is 1000.

epsilon

A small number used to stop the EM algorithm loop when the relative difference between log-likelihood at each iteration become sufficient small; the default value is 1e-6.

nkmeans

An integer to specify the number of KMEANS partitions to be used to find the best initial values; the default value is 0.

nrandom

An integer to specify the number of random partitions to be used to find the best initial values; the default value is 10.

nhclust

A logical value to specify whether or not to use hierarchical cluster methods; the default is FALSE. If TRUE, the Complete Linkage method will be used.

debug

A logical value, if it is TRUE, the output will be printed out; FALSE silent; the default value is TRUE.

initloop

A integer specifying the number of initial loops when searching the best intial partitions.

Value

error

Error code, 0 = normal exit; 1 = did not converge within itmax iterations; 2 = failed to get the initial values; 3 = singularity

aic

Akaike Information Criterion (AIC)

bic

Bayes Information Criterion (BIC)

ICL

Integrated Completed Likelihood Criterion (ICL)

pro

A vector of mixing proportions.

A numeric matrix with each column corresponding to the mean.

sigma

An array of dimension (p,p,g) with first two dimension corresponding covariance matrix of each component.

dof

A vector of degrees of freedom for each component, see Details.

delta

A p by g matrix with each column corresponding to a skew parameter vector.

clust

A vector of final partition

loglik

The log likelihood at convergence

A vector of log likelihood at each EM iteration

tau

An n by g matrix of posterior probability for each data point

Details

The distribution type, determined by the distr parameter, which may take any one of the following values: "mvn" for a multivariate normal, "mvt" for a multivariate t-distribution, "msn" for a multivariate skew normal distribution and "mst" for a multivariate skew t-distribution.

The covariance matrix type, represented by the ncov parameter, may be any one of the following: ncov=1 for a common variance, ncov=2 for a common diagonal variance, ncov=3 for a general variance, ncov =4 for a diagonal variance, ncov=5 for sigma(h)*I(p)(diagonal covariance with same identical diagonal element values).

The parameter init requires following elements: pro, a numeric vector of the mixing proportion of each component; mu, a p by g matrix with each column as its corresponding mean; sigma, a three dimensional p by p by g array with its jth component matrix (p,p,j) as the covariance matrix for jth component of mixture models; dof, a vector of degrees of freedom for each component; delta, a p by g matrix with its columns corresponding to skew parameter vectors.

Since we treat the list of pro,mu,sigma,dof,and delta as a common structure of parameters for our mixture models, we need to include all of them in the initial parameter list init by default although in some cases it does not make sense, for example, dof and delta is not applicable to normal mixture model. But in most cases, the user only need give relevent paramters in the list.

When the parameter list init is given, the program ignores both initial partition clust and automatic partition methods such as nkmeans; only when both init and clust are not available, the program uses automatic approaches such as k-Means partition method to find the best inital values. All three automatic approaches are used to find the best initial partition and initial values if required.

The return values include all potential parameters pro,mu,sigma,dof,and delta, but user should not use or interpret irrelevant information arbitrarily. For example, dof and delta for Normal mixture models.

References

Biernacki C. Celeux G., and Govaert G. (2000). Assessing a Mixture Model for Clustering with the integrated Completed Likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence. 22(7). 719-725.

McLachlan G.J. and Krishnan T. (2008). The EM Algorithm and Extensions (2nd). New Jersay: Wiley.

McLachlan G.J. and Peel D. (2000). Finite Mixture Models. New York: Wiley.

Examples

Run this code

# NOT RUN {
#define the dimension of dataset

n1=300;n2=300;n3=400;
nn<-c(n1,n2,n3)

p  <- 2
ng <- 3

#define the parameters
sigma<-array(0,c(2,2,3))
for(h in 2:3) sigma[,,h]<-diag(2)
sigma[,,1]<-cbind( c(1,0.2),c(0.2,1))
mu     <- cbind(c(4,-4),c(3.5,4),c( 0, 0))

#and other parameters if required for "mvt","msn","mst"
delta  <- cbind(c(3,3),c(1,5),c(-3,1))
dof    <- c(3,5,5)

pro   <- c(0.3,0.3,0.4)

distr="mvn"
ncov=3

# generate a data set

set.seed(111) #random seed is reset 

dat <- rdemmix(nn,p,ng,distr,mu,sigma)



# the following code can be used to get singular data (remarked off)
#	dat[1:300,2]<--4 
#	dat[300+1:300,1]<-2
##	dat[601:1000,1]<-0
##	dat[601:1000,2]<-0



#fit the data using KMEANS to get the initial partitions (10 trials)
obj <- EmSkew(dat,ng,distr,ncov,itmax=1000,epsilon=1e-5,nkmeans=10)


# alternatively, if we define initial values like 
initobj<-list()

initobj$pro  <- pro
initobj$mu   <- mu
initobj$sigma<- sigma


initobj$dof  <- dof
initobj$delta<- delta


# then we can fit the data from initial values
obj <- EmSkew(dat,ng,distr,ncov,init=initobj,itmax=1000,epsilon=1e-5)

# finally, if we know inital partition such as 
clust       <- rep(1:ng,nn)


# then we can fit the data from given initial partition
obj <- EmSkew(dat,ng,distr,ncov,clust=clust,itmax=1000,epsilon=1e-5)

# plot the 2D contours

colnames(dat)<- paste("x",1:p,sep='')

# dev.new()
EmSkew.flow(dat,obj)

# }

Run the code above in your browser using DataLab