hdmda: Mixture Discriminant Analysis with HD Gaussians

Description

HD-MDA implements mixture discriminant analysis (MDA, Hastie & Tibshirani, 1996) with HD Gaussians instead of full Gaussians. Each class is assumed to be made of several class-specific groups in which the data live in low-dimensional subspaces. From a technical point of view, a clustering is done using hddc in each class.

Usage

hdmda(X,cls,K=1:10,model='AkjBkQkDk',show=FALSE,...)

Arguments

A matrix or a data frame of observations, assuming the rows are the observations and the columns the variables. Note that NAs are not allowed.

cls

The vector of the class of each observations, its type can be numeric or string.

A vector of integers specifying the number of clusters for which the BIC and the parameters are to be calculated; the function keeps the parameters which maximises the BIC. Note that the length of the vector K can't be larger than 20. Default is 1:10.

model

A character string vector, or an integer vector indicating the models to be used. The available models are: "AkjBkQkDk" (default), "AkBkQkDk", "ABkQkDk", "AkjBQkDk", "AkBQkDk", "ABQkDk", "AkjBkQkD", "AkBkQkD", "ABkQkD", "AkjBQkD", "AkBQkD", "ABQkD", "AjBQD", "ABQD". It is not case sensitive and integers can be used instead of names, see details for more information. Several models can be used, if it is, only the results of the one which maximizes the BIC criterion is kept. To run all models, use model="ALL".

show

Use show = TRUE to display some information related to the clustering.

...

Any argument that can be used by the function hddc.

Value

hdmda returns an 'hdmda' object which is a list containing: returns an 'hdmda' object which is a list containing:

Details

Some information on the signification of the model names:

The model “all” will compute all the models, give their BIC and keep the model with the highest BIC value. Instead of writing the model names, they can also be specified using an integer. 1 represents the most general model (“AkjBkQkDk”) while 14 is the most constrained (“ABQD”), the others number/name matching are given below:

AkjBkQkDk	1		AkjBkQkD
7	AkBkQkDk	2
AkBkQkD	8	ABkQkDk	3
	ABkQkD	9	AkjBQkDk
4		AkjBQkD	10
AkBQkDk	5		AkBQkD
11	ABQkDk	6
ABQkD	12	AkjBkQkDk	1

References

C. Bouveyron and C. Brunet (2014), “Model-based clustering of high-dimensional data: A review”, Computational Statistics and Data Analysis, vol. 71, pp. 52-78.

Bouveyron, C. Girard, S. and Schmid, C. (2007), “High Dimensional Discriminant Analysis”, Communications in Statistics: Theory and Methods, vol. 36 (14), pp. 2607-2623.

Bouveyron, C. Celeux, G. and Girard, S. (2011), “Intrinsic dimension estimation by maximum likelihood in probabilistic PCA”, Pattern Recognition Letters, vol. 32 (14), pp. 1706-1713.

Berge, L. Bouveyron, C. and Girard, S. (2012), “HDclassif: An R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data”, Journal of Statistical Software, 46(6), pp. 1-29, url: http://www.jstatsoft.org/v46/i06/.

Hastie, T., & Tibshirani, R. (1996), “Discriminant analysis by Gaussian mixtures”, Journal of the Royal Statistical Society, Series B (Methodological), pp. 155-176.

Examples

Run this code

# Load the Wine data set
data(wine)
cls = wine[,1]; X = scale(wine[,-1])

# A simple use...
out = hdmda(X[1:100,],cls[1:100])
res = predict(out,X[101:nrow(X),])

# Comparison between hdmda and hdda in a CV setup
set.seed(123); nb = 10; Err = matrix(NA,2,nb)
for (i in 1:nb){
  cat('.')
  test = sample(nrow(X),50)
  out0 = lda(X[-test,],cls[-test])
  res0 = predict(out0,X[test,])
  Err[1,i] = sum(res0$class != cls[test]) / length(test)
  out = hdmda(X[-test,],cls[-test],K=1:3,model="AKJBQKDK")
  res = predict(out,X[test,])
  Err[2,i] = sum(res$class != cls[test]) / length(test)
}
cat('\n')
boxplot(t(Err),names=c('LDA','HD-MDA'),col=2:3,ylab="CV classifciation error",
  main='CV classifciation error on Wine data')

Run the code above in your browser using DataLab