tkmeans: Trimmed k-means Cluster Analysis

Description

tkmeans searches for k (or less) spherical clusters in a data matrix x, whereas the ceiling (alpha n) most outlying observations are trimmed.

Usage

tkmeans (x, k = 3, alpha = 0.05, nstart = 50, iter.max = 20, 
         equal.weights = FALSE, center = 0, scale = 1, store.x = TRUE,
         drop.empty.clust = TRUE, trace = 0, warnings = 2, zero.tol = 1e-16)

Arguments

A matrix or data.frame of dimension n x p, containing the observations (row-wise).

The number of clusters initially searched for.

alpha

The proportion of observations to be trimmed.

nstart

The number of random initializations to be performed.

iter.max

The maximum number of concentration steps to be performed. The concentration steps are stopped, whenever two consecutive steps lead to the same data partition.

equal.weights

A logical value, specifying whether equal cluster weights (TRUE) or not (FALSE) shall be considered in the concentration and assignment steps.

center, scale

A center and scale vector, each of length p which can optionally be specified for centering and scaling x before calculation

store.x

A logical value, specifying whether the data matrix x shall be included in the result structure. By default this value is set to TRUE, because functions plot.tkmeans depends on this information. However, when big data matrices are handled, the result structure's size can be decreased noticeably when setting this parameter to FALSE.

drop.empty.clust

Logical value specifying, whether empty clusters shall be omitted in the resulting object. (The result structure does not contain center and covariance estimates of empty clusters anymore. Cluster names are reassigned such that the first l clusters (l <= k) always have at least one observation.

trace

Defines the tracing level, which is set to 0 by default. Tracing level 2 gives additional information on the iteratively decreasing objective function's value.

warnings

The warning level (0: no warnings; 1: warnings on unexpected behavior.

zero.tol

The zero tolerance used. By default set to 1e-16.

Value

The function returns an S3 object of type tkmeans, containing the following values:

centers

A matrix of size p x k containing the centers (column-wise) of each cluster.

cluster

A numerical vector of size n containing the cluster assignment for each observation. Cluster names are integer numbers from 1 to k, 0 indicates trimmed observations.

par

A list, containing the parameters the algorithm has been called with (x, if not suppressed by store.x = FALSE, k, alpha, restr.fact, nstart, KStep, and equal.weights).

The (final) resulting number of clusters. Some solutions with a smaller number of clusters might be found when using the option equal.weights = FALSE.

obj

The value of the objective function of the best (returned) solution.

size

An integer vector of size k, returning the number of observations contained by each cluster.

weights

A numerical vector of length k, containing the weights of each cluster.

int

A list of values internally used by function related to tkmeans objects.

References

Cuesta-Albertos, J. A.; Gordaliza, A. and Matr<e1>n, C. (1997), "Trimmed k-means: an attempt to robustify quantizers". Annals of Statistics, Vol. 25 (2), 553-576.

Examples

Run this code

# NOT RUN {
#--- EXAMPLE 1 ------------------------------------------
sig <- diag (2)
cen <- rep (1,2)
x <- rbind(mvtnorm::rmvnorm(360, cen * 0,   sig),
            mvtnorm::rmvnorm(540, cen * 5,   sig * 6 - 2),
            mvtnorm::rmvnorm(100, cen * 2.5, sig * 50)
            )

# Two groups and 10% trimming level
clus <- tkmeans (x, k = 2, alpha = 0.1)

plot (clus)
plot (clus, labels = "observation")
plot (clus, labels = "cluster")

#--- EXAMPLE 2 ------------------------------------------
data (geyser2)
clus <- tkmeans (geyser2, k = 3, alpha = 0.03)
plot (clus)

#--- EXAMPLE 3 ------------------------------------------
data (swissbank)
# Two clusters and 8% trimming level
clus <- tkmeans (swissbank, k = 2, alpha = 0.08)

                            # Pairs plot of the clustering solution
pairs (swissbank, col = clus$cluster + 1)
                                  # Two coordinates
plot (swissbank[, 4], swissbank[, 6], col = clus$cluster + 1,
     xlab = "Distance of the inner frame to lower border",
     ylab = "Length of the diagonal")
plot (clus)

# Three clusters and 0% trimming level
clus <- tkmeans (swissbank, k = 3, alpha = 0.0)

                            # Pairs plot of the clustering solution
pairs (swissbank, col = clus$cluster + 1)

                                   # Two coordinates
plot (swissbank[, 4], swissbank[, 6], col = clus$cluster + 1, 
      xlab = "Distance of the inner frame to lower border", 
      ylab = "Length of the diagonal")

plot (clus)

# }

Run the code above in your browser using DataLab