tclust: General Trimming Approach to Robust Cluster Analysis

Description

tclust searches for k (or less) clusters with different covariance structures in a data matrix x. Relative cluster scatter can be restricted by a constant value restr.fact. For robustifying the estimation, a proportion alpha of observations may be trimmed. tkmeans implements a robust version of the data driven trimming method "k-means". In particular, the trimmed k-means method is represented by the tclust method, setting parameters restr = "eigen", restr.fact = 1 and equal.weights = TRUE.

Usage

tclust(x, k = 3, alpha = 0.05, niter = 50, Ksteps = 10, restr = c("eigen", "deter", "sigma"), restr.fact = 2, equal.weights = FALSE, center, scale, store.x = TRUE, drop.empty.clust = TRUE, trace = 0, zero.tol = 1e-16)
  tkmeans(x, k = 3, alpha = 0.05, niter = 50, Ksteps = 10, center, scale, store.x = TRUE, drop.empty.clust = TRUE, trace = 0, zero.tol = 1e-16)

Arguments

A matrix or dataframe of dimension n x p, containing the observations (row-wise).

The number of clusters initially searched for.

alpha

The proportion of observations to be trimmed.

niter

The number of random initializations to be performed.

Ksteps

The maximum number of concentration steps to be performed. The concentration steps are stopped, whenever two consecutive steps lead to the same data partition.

restr

The type of eigenvalue restriction to be applied. Valid values are "eigen" (default), "deter" and "sigma". See details for further explanation.

restr.fact

The constant restr.fact >= 1 constrains the allowed differences among group scatters. A larger value implies larger differences of group scatters. When using restr = "sigma" this parameter is not considered, as all cluster varian

equal.weights

A logical value, specifying whether equal cluster weights (TRUE) or not (FALSE) shall be considered in the concentration and assignment steps.

center, scale

A center and scale vector, each of length p which can optionally be specified for centering and scaling x before calculation

store.x

A logical value, specifying whether the data matrix x shall be included in the result structure. By default this value is set to TRUE, because functions plot.tclust and <

drop.empty.clust

Logical value specifying, whether empty clusters shall be omitted in the resulting object. (The result structure does not contain center and covariance estimates of empty clusters anymore. Cluster names are reassigned such that the first l cl

trace

Defines the tracing level, which is set to 0 by default. Tracing level 2 gives additional information on the iteratively decreasing objective function's value.

zero.tol

The zero tolerance used. By default set to 1e-16.

Value

The function returns an S3 object of type tclust, containing the following values:
centerA matrix of size p x k containing the centers (column-wise) of each cluster.
covAn array of size p x p x k containing the covariance matrices of each cluster.
assignA numerical vector of size n containing the cluster assignment for each observation. Cluster names are integer numbers from 1 to k, 0 indicates trimmed observations.
parA list, containing the parameters the algorithm has been called with (x, if not suppressed by store.x = FALSE, k, alpha, restr.fact, niter, KStep, and equal.weights).
kThe (final) resulting number of clusters. Some solutions with a smaller number of clusters might be found when using the option equal.weights = FALSE.
objThe value of the objective function of the best (returned) solution.
clustsizeAn integer vector of size k, returning the number of observations contained by each cluster.
weightsA numerical vector of length k, containing the weights of each cluster.
iter.convergedThe number of converged iterations.

encoding

latin1

Details

This iterative algorithm initializes k clusters randomly and performs "concentration steps" in order to improve the current cluster assignment. The number of maximum concentration steps to be performed is given by Ksteps. For approximately obtaining the global optimum, the system is initialized niter times and concentration steps are performed until convergence or Ksteps is reached. When processing more complex data sets higher values of niter and Ksteps have to be specified (obviously implying extra computation time). However, if more then half of the iterations would not converge, a warning message is issued, indicating that niter has to be increased. The parameter restr defines the cluster's shape restrictions, which are applied on all clusters during each iteration. Options "eigen"/"deter" restrict the ratio between the maximum and minimum eigenvalue/determinant of all cluster's covariance structures to parameter restr.fact. Setting restr.fact to 1, yields the strongest restriction, forcing all eigenvalues/determinants to be equal and so the method looks for similarly scattered (respectively spherical) clusters. Option "sigma" is a simpler restriction, which averages the covariance structures during each iteration (weighted by cluster sizes) in order to get similar (equal) cluster scatters.

References

Garcia-Escudero, L.A.; Gordaliza, A.; Matran, C. and Mayo-Iscar, A. (2008), "A General Trimming Approach to Robust Cluster Analysis". Annals of Statistics, Vol.36, 1324-1345. Technical Report available at www.eio.uva.es/inves/grupos/representaciones/trTCLUST.pdf

Examples

Run this code

#--- EXAMPLE 1 ------------------------------------------
sig <- diag (2)
cen <- rep (1,2)
x <- rbind (
	rmvnorm (360, cen * 0,   sig),
	rmvnorm (540, cen * 5,   sig * 6 - 2),
	rmvnorm (100, cen * 2.5, sig * 50)
)

# Two groups and 10\% trimming level
clus <- tclust (x, k=2, alpha=0.1, restr.fact=12)
plot (clus)
plot (clus, labels = "observation")
plot (clus, labels = "cluster")

# Three groups (one of them very scattered) and 0\% trimming level
clus <- tclust (x, k=3, alpha=0.0, restr.fact = 50) 
plot (clus)

#--- EXAMPLE 2 ------------------------------------------
data (geyser2)
clus <- tkmeans(geyser2,k=3, alpha=0.03)
plot(clus)

#--- EXAMPLE 3 ------------------------------------------
data (M5data)
x <- M5data[,1:2]

clus.a <- tclust (x,k = 3, alpha=0.1, restr.fact = 1, restr= "eigen", equal.weights = TRUE)
clus.b <- tclust (x,k = 3, alpha=0.1, restr.fact = 1, restr= "sigma", equal.weights = TRUE)
clus.c <- tclust (x,k = 3, alpha=0.1, restr.fact = 1, restr= "deter", equal.weights = TRUE)
clus.d <- tclust (x,k = 3, alpha=0.1, restr.fact = 50, restr= "deter", equal.weights = FALSE)
par(mfrow=c(2,2))
plot(clus.a,main="(a) tkmeans")
plot(clus.b,main="(b) Gallegos and Ritter")
plot(clus.c,main="(c) Gallegos")
plot(clus.d,main="(d) tclust")

#--- EXAMPLE 4 ------------------------------------------
data (swissbank)
# Two clusters and 8\% trimming level
clus <- tclust(swissbank,k = 2, alpha=0.08, restr.fact = 15)
pairs(swissbank,col=clus$assig+1) # Pairs plot of the clustering solution
plot(swissbank[,4],swissbank[,6],col=clus$assig+1,xlab="Distance of the inner frame to lower border", ylab="Length of the diagonal") # Two coordinates
plot(clus)

# Three clusters and 0\% trimming level
clus <- tclust(swissbank,k = 3, alpha=0.0, restr.fact = 15)
pairs(swissbank,col=clus$assig+1) # Pairs plot of the clustering solution
plot(swissbank[,4],swissbank[,6],col=clus$assig+1,xlab="Distance of the inner frame to lower border", ylab="Length of the diagonal") # Two coordinates
plot(clus)

Run the code above in your browser using DataLab