tclust: General Trimming Approach to Robust Cluster Analysis

Description

tclust searches for k (or less) clusters with different covariance structures in a data matrix x. Relative cluster scatter can be restricted by a constant value restr.fact. For robustifying the estimation, a proportion alpha of observations may be trimmed. In particular, the trimmed k-means method (tkmeans)is represented by the tclust method, setting parameters restr = "eigen", restr.fact = 1 and equal.weights = TRUE.

Usage

tclust (x, k = 3, alpha = 0.05, nstart = 50, iter.max = 20, 
        restr = c ("eigen", "deter", "sigma"), restr.fact = 12, 
        equal.weights = FALSE, center, scale, store.x = TRUE, 
        drop.empty.clust = TRUE, trace = 0, warnings = 3, 
        zero.tol = 1e-16)

Arguments

A matrix or data.frame of dimension n x p, containing the observations (row-wise).

The number of clusters initially searched for.

alpha

The proportion of observations to be trimmed.

nstart

The number of random initializations to be performed.

iter.max

The maximum number of concentration steps to be performed. The concentration steps are stopped, whenever two consecutive steps lead to the same data partition.

restr

The type of restriction to be applied on the cluster scatter matrices. Valid values are "eigen" (default), "deter" and "sigma". See the detail section for further explanation.

restr.fact

The constant restr.fact >= 1 constrains the allowed differences among group scatters. Larger values imply larger differences of group scatters, a value of 1 specifies the strongest restriction. When using restr = "sigma" this parameter is not considered, as all cluster variances are averaged, always implying restr.fact = 1.

equal.weights

A logical value, specifying whether equal cluster weights (TRUE) or not (FALSE) shall be considered in the concentration and assignment steps.

center, scale

A center and scale vector, each of length p which can optionally be specified for centering and scaling x before calculation

store.x

A logical value, specifying whether the data matrix x shall be included in the result structure. By default this value is set to TRUE, because functions plot.tclust and DiscrFact depend on this information. However, when big data matrices are handled, the result structure's size can be decreased noticeably when setting this parameter to FALSE.

drop.empty.clust

Logical value specifying, whether empty clusters shall be omitted in the resulting object. (The result structure does not contain center and covariance estimates of empty clusters anymore. Cluster names are reassigned such that the first l clusters (l <= k) always have at least one observation.

trace

Defines the tracing level, which is set to 0 by default. Tracing level 2 gives additional information on the iteratively decreasing objective function's value.

warnings

The warning level (0: no warnings; 1: warnings on unexpected behavior; 2: warnings if restr.fact causes artificially restricted results).

zero.tol

The zero tolerance used. By default set to 1e-16.

Value

The function returns an S3 object of type tclust, containing the following values:

centers

A matrix of size p x k containing the centers (column-wise) of each cluster.

cov

An array of size p x p x k containing the covariance matrices of each cluster.

cluster

A numerical vector of size n containing the cluster assignment for each observation. Cluster names are integer numbers from 1 to k, 0 indicates trimmed observations.

par

A list, containing the parameters the algorithm has been called with (x, if not suppressed by store.x = FALSE, k, alpha, restr.fact, nstart, KStep, and equal.weights).

The (final) resulting number of clusters. Some solutions with a smaller number of clusters might be found when using the option equal.weights = FALSE.

obj

The value of the objective function of the best (returned) solution.

size

An integer vector of size k, returning the number of observations contained by each cluster.

weights

A numerical vector of length k, containing the weights of each cluster.

int

A list of values internally used by function related to tclust objects.

% \item{iter.successful}{ % The number of successful iterations. % If \code{droclust = FALSE} is specified, some iterations may fail due to % too small cluster sizes. % When \code{k} is chosen too high, this value might decrease to zero! % } % \item{iter.converged}{ % The number of converged iterations. % } % \item{z}{ % If \code{"fuzzy = TRUE"} has been selected this value contains a matrix % with the fuzzy cluster pertinences. %}

Details

This iterative algorithm initializes k clusters randomly and performs "concentration steps" in order to improve the current cluster assignment. The number of maximum concentration steps to be performed is given by iter.max. For approximately obtaining the global optimum, the system is initialized nstart times and concentration steps are performed until convergence or iter.max is reached. When processing more complex data sets higher values of nstart and iter.max have to be specified (obviously implying extra computation time). However, if more then half of the iterations would not converge, a warning message is issued, indicating that nstart has to be increased.

The parameter restr defines the cluster's shape restrictions, which are applied on all clusters during each iteration. Options "eigen"/"deter" restrict the ratio between the maximum and minimum eigenvalue/determinant of all cluster's covariance structures to parameter restr.fact. Setting restr.fact to 1, yields the strongest restriction, forcing all eigenvalues/determinants to be equal and so the method looks for similarly scattered (respectively spherical) clusters. Option "sigma" is a simpler restriction, which averages the covariance structures during each iteration (weighted by cluster sizes) in order to get similar (equal) cluster scatters.

References

Garcia-Escudero, L.A.; Gordaliza, A.; Matran, C. and Mayo-Iscar, A. (2008), "A General Trimming Approach to Robust Cluster Analysis". Annals of Statistics, Vol.36, 1324-1345. Technical Report available at www.eio.uva.es/inves/grupos/representaciones/trTCLUST.pdf Fritz, H.; Garcia-Escudero, L.A.; Mayo-Iscar, A. (2012), "tclust: An R Package for a Trimming Approach to Cluster Analysis". Journal of Statistical Software, 47(12), 1-26. URL http://www.jstatsoft.org/v47/i12/

Examples

Run this code

# NOT RUN {
#--- EXAMPLE 1 ------------------------------------------
sig <- diag (2)
cen <- rep (1,2)
x <- rbind(mvtnorm::rmvnorm(360, cen * 0,   sig),
           mvtnorm::rmvnorm(540, cen * 5,   sig * 6 - 2),
           mvtnorm::rmvnorm(100, cen * 2.5, sig * 50)
           )

# Two groups and 10% trimming level
clus <- tclust (x, k = 2, alpha = 0.1, restr.fact = 8)

plot (clus)
plot (clus, labels = "observation")
plot (clus, labels = "cluster")

# Three groups (one of them very scattered) and 0% trimming level
clus <- tclust (x, k = 3, alpha=0.0, restr.fact = 100)

plot (clus)

# }
# NOT RUN {
<!-- %#--- EXAMPLE 2 ------------------------------------------ -->
# }
# NOT RUN {
<!-- %data (geyser2) -->
# }
# NOT RUN {
<!-- %clus <- tkmeans (geyser2, k = 3, alpha = 0.03) -->
# }
# NOT RUN {
<!-- %plot (clus) -->
# }
# NOT RUN {
#--- EXAMPLE 3 ------------------------------------------
data (M5data)
x <- M5data[, 1:2]

clus.a <- tclust (x, k = 3, alpha = 0.1, restr.fact =  1,
                  restr = "eigen", equal.weights = TRUE, warnings = 1)
clus.b <- tclust (x, k = 3, alpha = 0.1, restr.fact =  1,
                   equal.weights = TRUE, warnings = 1)
clus.c <- tclust (x, k = 3, alpha = 0.1, restr.fact =  1,
                  restr = "deter", equal.weights = TRUE, iter.max = 100,
		  warnings = 1)
clus.d <- tclust (x, k = 3, alpha = 0.1, restr.fact = 50,
                  restr = "eigen", equal.weights = FALSE)

pa <- par (mfrow = c (2, 2))
plot (clus.a, main = "(a) tkmeans")
plot (clus.b, main = "(b) Gallegos and Ritter")
plot (clus.c, main = "(c) Gallegos")
plot (clus.d, main = "(d) tclust")
par (pa)

#--- EXAMPLE 4 ------------------------------------------
data (swissbank)
# Two clusters and 8% trimming level
clus <- tclust (swissbank, k = 2, alpha = 0.08, restr.fact = 50)

                            # Pairs plot of the clustering solution
pairs (swissbank, col = clus$cluster + 1)
                                  # Two coordinates
plot (swissbank[, 4], swissbank[, 6], col = clus$cluster + 1,
     xlab = "Distance of the inner frame to lower border",
     ylab = "Length of the diagonal")
plot (clus)

# Three clusters and 0% trimming level
clus <- tclust (swissbank, k = 3, alpha = 0.0, restr.fact = 110)

                            # Pairs plot of the clustering solution
pairs (swissbank, col = clus$cluster + 1)

                                   # Two coordinates
plot (swissbank[, 4], swissbank[, 6], col = clus$cluster + 1, 
      xlab = "Distance of the inner frame to lower border", 
      ylab = "Length of the diagonal")

plot (clus)

# }

Run the code above in your browser using DataLab