VarSelCluster: Variable selection and clustering.

Description

This function performs the model selection and the maximum likelihood estimation. It can be used for clustering only (i.e., all the variables are assumed to be discriminative). In this case, you must specify the data to cluster (arg. x), the number of clusters (arg. g) and the option vbleSelec must be FALSE. This function can also be used for variable selection in clustering. In this case, you must specify the data to analyse (arg. x), the number of clusters (arg. g) and the option vbleSelec must be TRUE. Variable selection can be done with BIC, MICL or AIC.

Usage

VarSelCluster(x, gvals, vbleSelec = TRUE, crit.varsel = "BIC",
  initModel = 50, nbcores = 1, discrim = rep(1, ncol(x)), nbSmall = 250,
  iterSmall = 20, nbKeep = 50, iterKeep = 1000, tolKeep = 10^(-6))

Arguments

data.frame/matrix. Rows correspond to observations and columns correspond to variables. Continuous variables must be "numeric", count variables must be "integer" and categorical variables must be "factor"

gvals

numeric. It defines number of components to consider.

vbleSelec

logical. It indicates if a variable selection is done

crit.varsel

character. It defines the information criterion used for the variable selection. Without variable selection, you can use one of the three criteria: "AIC", "BIC" and "ICL". With variable selection, you can use "AIC", BIC" and "MICL".

initModel

numeric. It gives the number of initializations of the alternated algorithm maximizing the MICL criterion (only used if crit.varsel="MICL")

nbcores

numeric. It defines the numerber of cores used by the alogrithm

discrim

numeric. It indicates if each variable is discrimiative (1) or irrelevant (0) (only used if vbleSelec=0)

nbSmall

numeric. It indicates the number of SmallEM algorithms performed for the ML inference

iterSmall

numeric. It indicates the number of iterations for each SmallEM algorithm

nbKeep

numeric. It indicates the number of chains used for the final EM algorithm

iterKeep

numeric. It indicates the maximal number of iterations for each EM algorithm

tolKeep

numeric. It indicates the maximal gap between two successive iterations of EM algorithm which stops the algorithm

Value

Returns an instance of '>VSLCMresults.

References

Marbac, M. and Sedki, M. (2017). Variable selection for model-based clustering using the integrated completed-data likelihood. Statistics and Computing, 27 (4), 1049-1063.

Marbac, M. and Patin, E. and Sedki, M. (2018). Variable selection for mixed data clustering: Application in human population genomics. Journal of Classification, to appear.

Examples

Run this code

# NOT RUN {
# Package loading
require(VarSelLCM)

# Data loading:
# x contains the observed variables
# z the known statu (i.e. 1: absence and 2: presence of heart disease)
data(heart)
z <- heart[,"Class"]
x <- heart[,-13]

# Cluster analysis without variable selection
res_without <- VarSelCluster(x, 2, vbleSelec = FALSE)

# Cluster analysis with variable selection (with parallelisation)
res_with <- VarSelCluster(x, 2, nbcores = 2, initModel=40)

# Confusion matrices and ARI: variable selection decreases the misclassification error rate
print(table(z, res_without@partitions@zMAP))
print(table(z, res_with@partitions@zMAP))
ARI(z, res_without@partitions@zMAP)
ARI(z, res_with@partitions@zMAP)

# Summary of the best model
summary(res_with)

# Opening Shiny application to easily see the results
VarSelShiny(res_with)

# Parameters of the best model
print(res_with)

# Discriminative power of the variables (here, the most discriminative variable is MaxHeartRate)
plot(out, type="bar")
# Boxplot for continuous (or interger) variable
plot(out, y="MaxHeartRate", type="boxplot")

# Empirical and theoretical distributions (to check that clustering is pertinent)
plot(out, y="MaxHeartRate", type="cdf")

# Summary of categorical variable
plot(out, y="Sex")

# Summary of the probabilities of missclassification
plot(out, type="probs-class")

# Imputation by posterior mean for the first observation
not.imputed <- heart[1,-13]
imputed <- VarSelImputation(out)[1,]
rbind(not.imputed, imputed)

# }

Run the code above in your browser using DataLab