Learn R Programming

Model based clustering for mixed data: clustMD

Damien McParland March 22, 2017

This R package allows the user to perform model based clustering of mixed data (i.e. data that consist of continuous, binary, ordinal or nominal variables) using a parsimonious mixture of latent Gaussian variable models.

This model based clustering approach assumes that underlying the observed categorical response is a latent continuous variable. A finite mixture model is used to identify sub populations or clusters within the larger population.

Installation

The clustMD package can be easily installed in R as follows.

  install.packages("clustMD")

The Byar data set that is used in the examples is included in the package. This data set contains information on 475 prostate cancer patients. Measurements taken on these patients consist of continuous, binary, ordinal and nominal variables.

Functions

clustMD()

To use clustMD to cluster the Byar data set you may run the following code. The code consists of some simple pre-processing steps followed by the correct usage of the clustMD() function.

    data(Byar)
  # Transformation skewed variables
  Byar$Size.of.primary.tumour <- sqrt(Byar$Size.of.primary.tumour)
  Byar$Serum.prostatic.acid.phosphatase <- log(Byar$Serum.prostatic.acid.phosphatase)

    # Order variables (Continuous, ordinal, nominal)
    Y <- as.matrix(Byar[, c(1, 2, 5, 6, 8, 9, 10, 11, 3, 4, 12, 7)])

    # Start categorical variables at 1 rather than 0
    Y[, 9:12] <- Y[, 9:12] + 1

    # Standardise continuous variables
    Y[, 1:8] <- scale(Y[, 1:8])

    # Merge categories of EKG variable for efficiency
    Yekg <- rep(NA, nrow(Y))
    Yekg[Y[,12]==1] <- 1
    Yekg[(Y[,12]==2)|(Y[,12]==3)|(Y[,12]==4)] <- 2
    Yekg[(Y[,12]==5)|(Y[,12]==6)|(Y[,12]==7)] <- 3
    Y[, 12] <- Yekg

    res <- clustMD(X=Y, G=3, CnsIndx=8, OrdIndx=11, Nnorms=20000, 
    MaxIter=500, model="EVI", store.params=FALSE, scale=TRUE, 
    startCL="kmeans")

The clustMD() function outputs an S3 object of class clustMD. Basic S3 methods are included in the package also. The functions available are

  • print.clustMD()
  • summary.clustMD()
  • plot.clustMD()

The plot.clustMD() function produces a number of useful summary plots of the clustMD object.

clustMDparallel()

Another function is available to run multiple models in parallel called clustMDparallel(). This function takes a range of possible values for the number of clusters as a vector. It also takes a character vector as an input that specifies which of the covariance models are to be fitted.

  data(Byar)

  # Transformation skewed variables
  Byar$Size.of.primary.tumour <- sqrt(Byar$Size.of.primary.tumour)
  Byar$Serum.prostatic.acid.phosphatase <- 
  log(Byar$Serum.prostatic.acid.phosphatase)

  # Order variables (Continuous, ordinal, nominal)
  Y <- as.matrix(Byar[, c(1, 2, 5, 6, 8, 9, 10, 11, 3, 4, 12, 7)])

  # Start categorical variables at 1 rather than 0
  Y[, 9:12] <- Y[, 9:12] + 1

  # Standardise continuous variables
  Y[, 1:8] <- scale(Y[, 1:8])

  # Merge categories of EKG variable for efficiency
  Yekg <- rep(NA, nrow(Y))
  Yekg[Y[,12]==1] <- 1
  Yekg[(Y[,12]==2)|(Y[,12]==3)|(Y[,12]==4)] <- 2
  Yekg[(Y[,12]==5)|(Y[,12]==6)|(Y[,12]==7)] <- 3
  Y[, 12] <- Yekg

  res <- clustMDparallel(X=Y, G=1:3, CnsIndx=8, OrdIndx=11, Nnorms=20000, 
  MaxIter=500, models=c("EVI", "EII", "VII"), store.params=FALSE, 
  scale=TRUE, startCL="kmeans")

The clustMDparallel() function outputs an S3 object of class clustMDparallel. Some S3 methods are also available for this class:

  • print.clustMDparallel()
  • summary.clustMDparallel()
  • plot.clustMDparallel()

The plot.clustMDparallel() function outputs the same plots as plot.clustMD() but for the optimal model according to the approximated BIC criterion. An additional plot is also included that illustrated the approximated BIC values for the fitted models.

Copy Link

Version

Install

install.packages('clustMD')

Monthly Downloads

290

Version

1.2.1

License

GPL-2

Maintainer

Damien McParland

Last Published

May 8th, 2017

Functions in clustMD (1.2.1)

Byar

Byar prostate cancer data set.
E.step

E-step of the (MC)EM algorithm
M.step

M-step of the (MC)EM algorithm
ObsLogLikelihood

Approximates the observed log likelihood.
clustMDlist

Model Based Clustering for Mixed Data
clustMDparallel

Run multiple clustMD models in parallel
clustMDparcoord

Parallel coordinates plot adapted for
dtmvnom

Return the mean and covariance matrix of a truncated multivariate normal
clustMD-package

Model based clustering for mixed data: clustMD
clustMD

Model Based Clustering for Mixed Data
perc.cutoffs

Calculates the threshold parameters for ordinal variables.
plot.clustMD

Plotting method for objects of class
z.moments

Calculates the first and second moments of the latent data
plot.clustMDparallel

Summary plots for a clustMDparallel object
print.clustMD

Print basic details of
npars_clustMD

Calculates the number of free parameters for the
patt.equal

Check if response patterns are equal
stable.probs

Stable computation of the log of a sum
getOutput_clustMDparallel

Extracts relevant output from
modal.value

Calculate the mode of a sample
summary.clustMD

Summarise
z.moments_diag

Calculates the first and second moments of the latent data for diagonal models
print.clustMDparallel

Print basic details of
qfun

Helper internal function for
z.nom.diag

Transforms Monte Carlo simulated data into categorical data. Calculates
summary.clustMDparallel

Prints a summary of a clustMDparallel object to screen.
vec.outer

Calculate the outer product of a vector with itself