tclustfsda: Computes trimmed clustering with scatter restrictions

Description

Partitions the points in the n-by-v data matrix Y into k clusters. This partition minimizes the trimmed sum, over all clusters, of the within-cluster sums of point-to-cluster-centroid distances. Rows of Y correspond to points, columns correspond to variables. Returns in the output object of class tclustfsda.object an n-by-1 vector idx containing the cluster indices of each point. By default, tclustfsda() uses (squared), possibly constrained, Mahalanobis distances.

Usage

tclustfsda(x, k, alpha, restrfactor = 12, monitoring = FALSE,
  plot = FALSE, nsamp, refsteps = 15, reftol = 1e-13,
  equalweights = FALSE, mixt = 0, msg = TRUE, nocheck = FALSE,
  startv1 = 1, restrtype = c("eigen", "deter"), UnitsSameGroup,
  numpool, cleanpool, trace = FALSE, ...)

Arguments

An n x p data matrix (n observations and p variables). Rows of x represent observations, and columns represent variables.

Missing values (NA's) and infinite values (Inf's) are allowed, since observations (rows) with missing or infinite values will automatically be excluded from the computations.

Number of groups.

alpha

A scalar between 0 and 0.5 or an integer specifying the number of observations which have to be trimmed. If alpha=0, tclust reduces to traditional model based or mixture clustering (mclust): see for example the Matlab function gmdistribution.

More in detail, if 0 < alpha < 1 clustering is based on h = fix(n * (1-alpha)) observations, else if alpha is an integer greater than 1 clustering is based on h = n - floor(alpha). If monitoring=TRUE, alpha is a vector which specifies the values of trimming levels which have to be considered - contains decresing elements which lie in the interval 0 and 0.5. For example if alpha=c(0.1, 0.05, 0), tclust() considers these 3 values of trimming level. The default for alpha is vector alpha=c(0.1, 0.05, 0). The sequence is forced to be monotonically decreasing.

restrfactor

Positive scalar which constrains the allowed differences among group scatters. Larger values imply larger differences of group scatters. On the other hand a value of 1 specifies the strongest restriction forcing all eigenvalues/determinants to be equal and so the method looks for similarly scattered (respectively spherical) clusters. The default is to apply restrfactor to eigenvalues. In order to apply restrfactor to determinants it is is necessary to use optional input argument restrtype.

monitoring

If monitoring=TRUE TCLUST is performed for a series of values of the trimming factor alpha given k (number of groups) and given c (restriction factor). In order to increase the speed of the computations, parfor is used.

plot

If plot=FALSE (default) or plot=0 no plot is produced. If plot=TRUE and monitoring=FALSE a plot with the classification is shown (using the spmplot function). The plot can be:

for p = 1, a histogram of the univariate data,
for p = 2, a bivariate scatterplot,
for p > 2, a scatterplot matrix generated by the MATLAB function spmplot().

When p >= 2 the following additional features are offered (for p = 1 the behaviour is forced to be as for plots=TRUE):

plot = 'contourf' adds in the background of the bivariate scatterplots a filled contour plot. The colormap of the filled contour is based on grey levels as default. This argument may also be inserted in a field named 'type' of a list. In the latter case it is possible to specify the additional field 'cmap', which changes the default colors of the color map used. The field 'cmap' may be a three-column matrix of values in the range [0,1] where each row is an RGB triplet that defines one color. Check the colormap function for additional informations.
plot = 'contour' adds in the background of the bivariate scatterplots a contour plot. The colormap of the contour is based on grey levels as default. This argument may also be inserted in a field named type of a list. In the latter case it is possible to specify the additional field cmap, which changes the default colors of the color map used. The field cmap may be a three-column matrix of values in the range [0,1] where each row is an RGB triplet that defines one color. Check the colormap() (MATLAB) function for additional information.
plot = 'ellipse' superimposes confidence ellipses to each group in the bivariate scatterplots. The size of the ellipse is qchisq(0.95, 2), i.e. the confidence level used by default is 95 percent. This argument may also be inserted in a field named type of a list. In the latter case it is possible to specify the additional field conflev, which specifies the confidence level to use and it is a value between 0 and 1.
plot = 'boxplotb' superimposes on the bivariate scatterplots the bivariate boxplots for each group, using the boxplotb function. This argument may also be inserted in a field named type of a list.

The parameter plot can be also a list and in this case its elements are:

type - specifies the type of plot as when plot option is a character. Therefore, plots$type can be one of 'contourf', 'contour', 'ellipse' or 'boxplotb'.
cmap - used to set a colormap for the plot type (MATLAB style). For example, plot$cmap = 'autumn'. See the MATLAB help of colormap for a list of colormap possiblilites.
conflev - this is the confidence level for the confidence ellipses. It must me a scalar between 0 and 1.

If plot=TRUE and monitoring=TRUE two plots are shown. The first plot (monitor plot) shows three panels with the monitoring between two consecutive values of alpha: (i) the change in classification using ARI index (top panel), (ii) the change in centroids using squared euclidean distances (central panel) and (iii) the change in covariance matrices using squared euclidean distance (bottom panel).

The second plot (gscatter plot) shows a series of subplots which monitor the classification for each value of alpha. In order to make sure that consistent labels are used for the groups, between two consecutive values of alpha, we assign label r to a group if this group shows the smallest distance with group r for the previous value of alpha. The type of plot which is used to monitor the stability of the classification depends on the data dimensionality p.

for p = 1, a histogram of the univariate data (the MATLAB function histFS() is called),
for p = 2, a bivariate scatterplot (the MATLAB function gscatter() is called),
for p > 2, a scatterplot of the first two principal components (function gscatter() is called and we show on the axes titles the percentage of variance explained by the first two principal components).

Also in the case of monitoring=TRUE the parameter plot can be a list and its elements are:

name: character vector which enables to specify which plot to display. name = "gscatter" produces a figure with a series of subplots which show the classification for each value of alpha. name = "monitor" shows a figure with three panels which monitor between two consecutive values of alpha the change in classification using ARI index (top panel), the change in centroids using squared euclidean distances (central panel), the change in covariance matrices using squared euclidean distance (bottom panel). If this field is not specified, by default name=c("gscatter", "monitor") and both figures will be shown.
alphasel: a numeric vector which specifies for which values of alpha it is possible to see the classification. For example if alphasel = c(0.05, 0.02), the classification will be shown just for alpha=0.05 and alpha=0.02. If this field is not specified alphasel=alpha and therefore the classification is shown for each value of alpha.

nsamp

If a scalar, it contains the number of subsamples which will be extracted. If nsamp = 0 all subsets will be extracted. Remark - if the number of all possible subset is greater than 300 the default is to extract all subsets, otherwise just 300. If nsamp is a matrix it contains in the rows the indexes of the subsets which have to be extracted. nsamp in this case can be conveniently generated by function subsets(). nsamp can have k columns or k * (p + 1) columns. If nsamp has k columns the k initial centroids each iteration i are given by X[nsamp[i,] ,] and the covariance matrices are equal to the identity.

If nsamp has k * (p + 1) columns, the initial centroids and covariance matrices in iteration i are computed as follows:

X1 <- X[nsamp[i ,] ,]
mean(X1[1:p + 1, ]) contains the initial centroid for group 1
cov(X1[1:p + 1, ]) contains the initial cov matrix for group 1
mean(X1[(p + 2):(2*p + 2), ]) contains the initial centroid for group 2
cov(X1[(p + 2):(2*p + 2), ]) contains the initial cov matrix for group 2
...
mean(X1[(k-1)*p+1):(k*(p+1), ]) contains the initial centroids for group k
cov(X1[(k-1)*p+1):(k*(p+1), ]) contains the initial cov matrix for group k.

REMARK: If nsamp is not a scalar, the option startv1 given below is ignored. More precisely, if nsamp has k columns startv1 = 0 else if nsamp has k*(p+1) columns option startv1=1.

refsteps

Number of refining iterations in each subsample. Default is refsteps=15. refsteps = 0 means "raw-subsampling" without iterations.

reftol

Tolerance of the refining steps. The default value is 1e-14

equalweights

A logical specifying wheather cluster weights in the concentration and assignment steps shall be considered. If equalweights=TRUE we are (ideally) assuming equally sized groups, else if equalweights = false (default) we allow for different group weights. Please, check in the given references which functions are maximized in both cases.

mixt

Specifies whether mixture modelling or crisp assignment approach to model based clustering must be used. In the case of mixture modelling parameter mixt also controls which is the criterion to find the untrimmed units in each step of the maximization. If mixt >=1 mixture modelling is assumed else crisp assignment. The default value is mixt=0, i.e. crisp assignment. Please see for details the provided references. The parameter mixt also controls the criterion to select the units to trim. If mixt = 2 the h units are those which give the largest contribution to the likelihood, else if mixt=1 the criterion to select the h units is exactly the same as the one which is used in crisp assignment.

msg

Controls whether to display or not messages on the screen If msg==TRUE (default) messages are displayed on the screen. If msg=2, detailed messages are displayed, for example the information at iteration level.

nocheck

Check input arguments. If nocheck=TRUE no check is performed on matrix X. The default nocheck=FALSE.

startv1

How to initialize centroids and covariance matrices. Scalar. If startv1=1 then initial centroids and covariance matrices are based on (p+1) observations randomly chosen, else each centroid is initialized taking a random row of input data matrix and covariance matrices are initialized with identity matrices. The default value isstartv1=1.

Remark 1: in order to start with a routine which is in the required parameter space, eigenvalue restrictions are immediately applied.

Remark 2 - option startv1 is used just if nsamp is a scalar (see for more details the help associated with nsamp).

restrtype

Type of restriction to be applied on the cluster scatter matrices. Valid values are 'eigen' (default), or 'deter'. "eigen" implies restriction on the eigenvalues while "deter" implies restriction on the determinants.

UnitsSameGroup

List of the units which must (whenever possible) have a particular label. For example UnitsSameGroup=c(20, 26), means that group which contains unit 20 is always labelled with number 1. Similarly, the group which contains unit 26 is always labelled with number 2, (unless it is found that unit 26 already belongs to group 1). In general, group which contains unit UnitsSameGroup(r) where r=2, ...length(kk)-1 is labelled with number r (unless it is found that unit UnitsSameGroup(r) has already been assigned to groups 1, 2, ..., r-1.

numpool

The number of parallel sessions to open. If numpool is not defined, then it is set equal to the number of physical cores in the computer.

cleanpool

Logical, indicating if the open pool must be closed or not. It is useful to leave it open if there are subsequent parallel sessions to execute, so that to save the time required to open a new pool.

trace

Whether to print intermediate results. Default is trace=FALSE.

...

potential further arguments passed to lower level functions.

Value

Depending on the input parameter monitoring, one of the following objects will be returned:

tclustfsda.object
tclusteda.object

Details

This iterative algorithm initializes k clusters randomly and performs concentration steps in order to improve the current cluster assignment. The number of maximum concentration steps to be performed is given by input parameter refsteps. For approximately obtaining the global optimum, the system is initialized nsamp times and concentration steps are performed until convergence or refsteps is reached. When processing more complex data sets higher values of nsamp and refsteps have to be specified (obviously implying extra computation time). However, if more then 10 per cent of the iterations do not converge, a warning message is issued, indicating that nsamp has to be increased.

References

Garcia-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A. (2008). A General Trimming Approach to Robust Cluster Analysis. Annals of Statistics, Vol. 36, 1324-1345. [Technical Report available at: http://www.eio.uva.es/inves/grupos/representaciones/trTCLUST.pdf]

Examples

Run this code

# NOT RUN {
 
# }
# NOT RUN {
 data(hbk)
 (out <- tclustfsda(hbk[, 1:3], k=2))
 class(out)
 summary(out)

 ##  TCLUST of Gayser data with three groups (k=3), 10%% trimming (alpha=0.1)
 ##      and restriction factor (c=10000)
 data(geyser2)
 (out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000))

 ## Use the plot options to produce more complex plots ----------

 ##  Plot with all default options
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=TRUE)

 ##  Default confidence ellipses.
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot="ellipse")

 ##  Confidence ellipses specified by the user: confidence ellipses set to 0.5
 plots <- list(type="ellipse", conflev=0.5)
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=plots)

 ##  Confidence ellipses set to 0.9
 plots <- list(type="ellipse", conflev=0.9)
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=plots)

 ##  Contour plots
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot="contour")

 ##  Filled contour plots with additional options: contourf plot with autumn colormap
 plots <- list(type="contourf", cmap="autumn")
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=plots)

 ##  We compare the output using three different values of restriction factor
 ##      nsamp is the number of subsamples which will be extracted
 data(geyser2)
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, nsamp=500, plot="ellipse")
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10, nsamp=500, refsteps=10, plot="ellipse")
 out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=1, nsamp=500, refsteps=10, plot="ellipse")

 ##  TCLUST applied to M5 data: A bivariate data set obtained from three normal
 ##  bivariate distributions with different scales and proportions 1:2:2. One of the
 ##  components is very overlapped with another one. A 10 per cent background noise is
 ##  added uniformly distributed in a rectangle containing the three normal components
 ##  and not very overlapped with the three mixture components. A precise description
 ##  of the M5 data set can be found in Garcia-Escudero et al. (2008).
 ##

 data(M5data)
 plot(M5data[, 1:2])

 ##  Scatter plot matrix
 plot(CovClassic(M5data[,1:2]), which="pairs")

 out <- tclustfsda(M5data[,1:2], k=3, alpha=0, restrfactor=1000, nsamp=100, plot=TRUE)
 out <- tclustfsda(M5data[,1:2], k=3, alpha=0, restrfactor=10, nsamp=100, plot=TRUE)
 out <- tclustfsda(M5data[,1:2], k=3, alpha=0.1, restrfactor=1, nsamp=1000,
         plot=TRUE, equalweights=TRUE)
 out <- tclustfsda(M5data[,1:2], k=3, alpha=0.1, restrfactor=1000, nsamp=100, plot=TRUE)

 ##  TCLUST with simulated data: 5 groups and 5 variables
 ##
 n1 <- 100
 n2 <- 80
 n3 <- 50
 n4 <- 80
 n5 <- 70
 p <- 5
 Y1 <- matrix(rnorm(n1 * p) + 5, ncol=p)
 Y2 <- matrix(rnorm(n2 * p) + 3, ncol=p)
 Y3 <- matrix(rnorm(n3 * p) - 2, ncol=p)
 Y4 <- matrix(rnorm(n4 * p) + 2, ncol=p)
 Y5 <- matrix(rnorm(n5 * p), ncol=p)

 group <- c(rep(1, n1), rep(2, n2), rep(3, n3), rep(4, n4), rep(5, n5))
 Y <- Y1
 Y <- rbind(Y, Y2)
 Y <- rbind(Y, Y3)
 Y <- rbind(Y, Y4)
 Y <- rbind(Y, Y5)
 dim(Y)
 table(group)
 out <- tclustfsda(Y, k=5, alpha=0.05, restrfactor=1.3, refsteps=20, plot=TRUE)

 ##  Automatic choice of best number of groups for Geyser data ------------------------
 ##
 data(geyser2)
 maxk <- 6
 CLACLA <- matrix(0, nrow=maxk, ncol=2)
 CLACLA[,1] <- 1:maxk
 MIXCLA <- MIXMIX <- CLACLA

 for(j in 1:maxk) {
     out <- tclustfsda(geyser2, k=j, alpha=0.1, restrfactor=5, msg=FALSE)
     CLACLA[j, 2] <- out$CLACLA
 }

 for(j in 1:maxk) {
     out <- tclustfsda(geyser2, k=j, alpha=0.1, restrfactor=5, mixt=2, msg=FALSE)
     MIXMIX[j ,2] <- out$MIXMIX
     MIXCLA[j, 2] <- out$MIXCLA
 }

 oldpar <- par(mfrow=c(1,3))
 plot(CLACLA[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="CLACLA")
 plot(MIXMIX[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="MIXMIX")
 plot(MIXCLA[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="MIXCLA")
 par(oldpar)


 ##  Monitoring examples ------------------------------------------

 ##  Monitoring using Geyser data

 ##  Monitoring using Geyser data (all default options)
 ##  alpha and restriction factor are not specified therefore alpha=c(0.10, 0.05, 0)
 ##  is used while the restriction factor is set to c=12
 out <- tclustfsda(geyser2, k=3, monitoring=TRUE)

 ##  Monitoring using Geyser data with alpha and c specified.
 out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.10, 0, by=-0.01), monitoring=TRUE)

 ##  Monitoring using Geyser data with plot argument specified as a list.
 ##      The trimming levels to consider in this case are 31 values of alpha
 ##
 out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.30, 0, by=-0.01), monitoring=TRUE,
         plot=list(alphasel=c(0.2, 0.10, 0.05, 0.01)), trace=TRUE)

 ##  Monitoring using Geyser data with argument UnitsSameGroup
 ##
 ##      Make sure that group containing unit 10 is in a group which is labelled "group 1"
 ##      and group containing unit 12 is in group which is labelled "group 2"
 ##
 ##      Mixture model is used (mixt=2)
 ##
 out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.30, 0, by=-0.01), monitoring=TRUE,
         mixt=2, UnitsSameGroup=c(10, 12), trace=TRUE)

 ##  Monitoring using M5 data
 data(M5data)

 ##  alphavec=vector which contains the trimming levels to consider
 ##  in this case 31 values of alpha are considered
 alphavec <- seq(0.10, 0, by=-0.02)
 out <- tclustfsda(M5data[, 1:2], 3, alpha=alphavec, restrfac=1000, monitoring=TRUE,
     nsamp=1000, plots=TRUE)
 
# }

Run the code above in your browser using DataLab