Rmixmod-package: Rmixmod a MIXture MODelling package

Description

Rmixmod is a package based on the existing MIXMOD software. MIXMOD is a tool for fitting a mixture model of multivariate gaussian or multinomial components to a given data set with either a clustering, a density estimation or a discriminant analysis point of view.

Arguments

Details

The general purpose of the package is to discover, or explain, group structures in multivariate data sets with unknown (cluster analysis or clustering) or known class discriminant analysis or classification). It is an exploratory data analysis tool for solving clustering and classification problems. But it can also be regarded as a semi-parametric tool to estimate densities with Gaussian mixture distributions and multinomial distributions.

Mathematically, mixture probability density function (pdf) $f$ is a weighted sum of $K$ components densities :

$$ f({\bf x}_i|\theta) = \sum_{k=1}^{K}p_kh({\bf x}_i|\lambda_k) $$ where $h(.|{\lambda}_k)$ denotes a $d$-dimensional distribution parametrized by $\lambda_k$. The parameters are the mixing proportions $p_k$ and the component of the distribution $\lambda_k$.

In the Gaussian case, $h$ is the density of a Gaussian distribution with mean $\mu_k$ and variance matrix $\Sigma_k$, and thus $\lambda_k = (\mu_k,\Sigma_k)$.

In the qualitative case, $h$ is a multinomial distribution and $\lambda_k=(a_k,\epsilon_k)$ is the parameter of the distribution.

Estimation of the mixture parameters is performed either through maximum likelihood via the EM (Expectation Maximization, Dempster et al. 1977), the SEM (Stochastic EM, Celeux and Diebolt 1985) algorithm or through classification maximum likelihood via the CEM algorithm (Clustering EM, Celeux and Govaert 1992). These three algorithms can be chained to obtain original fitting strategies (e.g. CEM then EM with results of CEM) to use advantages of each of them in the estimation process. As mixture problems usually have multiple relative maxima, the program will produce different results, depending on the initial estimates supplied by the user. If the user does not input his own initial estimates, some initial estimates procedures are proposed (random centers for instance).

It is possible to constrain some input parameters. For example, dispersions can be equal between classes, etc.

In the Gaussian case, fourteen models are implemented. They are based on the eigenvalue decomposition, are most generally used. They depend on constraints on the variance matrix such as same variance matrix between clusters, spherical variance matrix... and they are suitable for data sets in any dimension.

In the qualitative case, five multinomial models are available. They are based on a reparametrization of the multinomial probabilities.

In both cases, the models and the number of clusters can be chosen by different criteria : BIC (Bayesian Information Criterion), ICL (Integrated Completed Likelihood, a classification version of BIC), NEC (Entropy Criterion), or Cross-Validation (CV).

References

Biernacki C., Celeux G., Govaert G., Langrognet F., 2006. "Model-Based Cluster and Discriminant Analysis with the MIXMOD Software". Computational Statistics and Data Analysis, vol. 51/2, pp. 587-600.

Examples

Run this code

# NOT RUN {
  
# }
# NOT RUN {
  ## Clustering Analysis
  # load quantitative data set
  data(geyser)
  # Clustering in gaussian case
  xem1<-mixmodCluster(geyser,3)
  summary(xem1)
  plot(xem1)
  hist(xem1)

  # load qualitative data set
  data(birds)
  # Clustering in multinomial case
  xem2<-mixmodCluster(birds, 2)
  summary(xem2)
  barplot(xem2)

  # load heterogeneous data set
  data(finance)
  # Clustering in composite case
  xem3<-mixmodCluster(finance,2:6)
  summary(xem3)

  ## Discriminant Analysis
  # start by extract 10 observations from iris data set
  remaining.obs<-sample(1:nrow(iris),10)
  # then run a mixmodLearn() analysis without those 10 observations
  learn<-mixmodLearn(iris[-remaining.obs,1:4], iris$Species[-remaining.obs])
  # create a MixmodPredict to predict those 10 observations
  prediction <- mixmodPredict(data=iris[remaining.obs,1:4], classificationRule=learn["bestResult"])
  # show results
  prediction
  # compare prediction with real results
  paste("accuracy= ",mean(as.integer(iris$Species[remaining.obs]) == prediction["partition"])*100
     	,"%",sep="")
  
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab