flexmixedruns: Fitting mixed Gaussian/multinomial mixtures with flexmix

Description

flexmixedruns fits a latent class mixture (clustering) model where some variables are continuous and modelled within the mixture components by Gaussian distributions and some variables are categorical and modelled within components by independent multinomial distributions. The fit is by maximum likelihood estimation computed with the EM-algorithm. The number of components can be estimated by the BIC.

Note that at least one categorical variable is needed, but it is possible to use data without continuous variable.

Usage

flexmixedruns(x,diagonal=TRUE,xvarsorted=TRUE,
                          continuous,discrete,ppdim=NULL,initial.cluster=NULL,
                          simruns=20,n.cluster=1:20,verbose=TRUE,recode=TRUE,
                          allout=TRUE,control=list(minprior=0.001),silent=TRUE)

Arguments

data matrix or data frame. The data need to be organised case-wise, i.e., if there are categorical variables only, and 15 cases with values c(1,1,2) on the 3 variables, the data matrix needs 15 rows with values 1 1 2. (Categorical variables co

diagonal

logical. If TRUE, Gaussian models are fitted restricted to diagonal covariance matrices. Otherwise, covariance matrices are unrestricted. TRUE is consistent with the "within class independence" assumption for the mult

xvarsorted

logical. If TRUE, the continuous variables are assumed to be the first ones, and the categorical variables to be behind them.

continuous

vector of integers giving positions of the continuous variables. If xvarsorted=TRUE, a single integer, number of continuous variables.

discrete

vector of integers giving positions of the categorical variables. If xvarsorted=TRUE, a single integer, number of categorical variables.

ppdim

vector of integers specifying the number of (in the data) existing categories for each categorical variable. If recode=TRUE, this can be omitted and is computed automatically.

initial.cluster

this corresponds to the cluster parameter in flexmix and should only be specified if simruns=1 and n.cluster is a single number. Either a matrix with n.cluster columns of initial

simruns

integer. Number of starts of the EM algorithm with random initialisation in order to find a good global optimum.

n.cluster

vector of integers, numbers of components (the optimum one is found by minimising the BIC).

verbose

logical. If TRUE, some information about the different runs of the EM algorithm is given out.

recode

logical. If TRUE, the function discrete.recode is applied in order to recode categorical data so that the lcmixed-method can use it. Only set this to FALSE if your data already has that forma

allout

logical. If TRUE, the regular flexmix-output is given out for every single number of clusters, which can create a huge output object.

control

list of control parameters for flexmix, for details see the help page of FLXcontrol-class.

silent

logical. This is passed on to the try-function. If FALSE, error messages from failed runs of flexmix are suppressed. (The information that a flexmix-error occu

Value

A list with components
optsummarysummary object for flexmix object with optimal number of components.
optimalkoptimal number of components.
errcountvector with numbers of EM runs for each number of components that led to flexmix errors.
flexoutif allout=TRUE, list of flexmix output objects for all numbers of components, for details see the help page of flexmix-class. Slots that can be used include for example cluster and components. So if fo is the flexmixedruns-output object, fo$flexout[[fo$optimalk]]@cluster gives a component number vector for the observations (maximum posterior rule), and fo$flexout[[fo$optimalk]]@components gives the estimated model parameters, which for lcmixed and therefore flexmixedruns are called [object Object],[object Object],[object Object] If allout=FALSE, only the flexmix output object for the optimal number of components, i.e., the [[fo$optimalk]] indexing above can then be omitted.
bicvalsvector of values of the BIC for each number of components.
ppdimvector of categorical variable-wise numbers of categories.
discretelevelslist of levels of the categorical variables belonging to what is treated by flexmixedruns as category 1, 2, 3 etc.

Details

Sometimes flexmix produces errors because of degenerating covariance matrices, too small clusters etc. flexmixedruns tolerates these and treats them as non-optimal runs. (Higher simruns or different control may be required to get a valid solution.) General documentation on flexmix can be found in Friedrich Leisch's "FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R", http://cran.r-project.org/web/packages/flexmix/vignettes/flexmix-intro.pdf

References

Hennig, C. and Liao, T. (2010) Comparing latent class and dissimilarity based clustering for mixed type variables with application to social stratification. Research report no. 308, Department of Statistical Science, UCL. http://www.ucl.ac.uk/Stats/research/reports/psfiles/rr308.pdf Revised version accepted for publication by Journal of the Royal Statistical Society Series C.

Examples

Run this code

set.seed(776655)
  v1 <- rnorm(100)
  v2 <- rnorm(100)
  d1 <- sample(1:5,100,replace=TRUE)
  d2 <- sample(1:4,100,replace=TRUE)
  ldata <- cbind(v1,v2,d1,d2)
  fr <- flexmixedruns(ldata,
    continuous=2,discrete=2,simruns=2,n.cluster=2:3,allout=FALSE)
  print(fr$optimalk)
  print(fr$optsummary)
  print(fr$flexout@cluster)
  print(fr$flexout@components)

Run the code above in your browser using DataLab