tclust: TCLUST method for robust clustering

Description

This function searches for k (or less) clusters with different covariance structures in a data matrix x. Relative cluster scatter can be restricted when restr="eigen" by constraining the ratio between the largest and the smallest of the scatter matrices eigenvalues by a constant value restr.fact. Relative cluster scatters can be also restricted with restr="deter" by constraining the ratio between the largest and the smallest of the scatter matrices' determinants.

For robustifying the estimation, a proportion alpha of observations is trimmed. In particular, the trimmed k-means method is represented by the tclust() method, by setting parameters restr.fact=1, opt="HARD" and equal.weights=TRUE.

Usage

tclust(
  x,
  k,
  alpha = 0.05,
  nstart = 500,
  niter1 = 3,
  niter2 = 20,
  nkeep = 5,
  iter.max,
  equal.weights = FALSE,
  restr = c("eigen", "deter"),
  restr.fact = 12,
  cshape = 1e+10,
  opt = c("HARD", "MIXT"),
  center = FALSE,
  scale = FALSE,
  store_x = TRUE,
  parallel = FALSE,
  n.cores = -1,
  zero_tol = 1e-16,
  drop.empty.clust = TRUE,
  trace = 0
)

Value

The function returns the following values:

cluster - A numerical vector of size n containing the cluster assignment for each observation. Cluster names are integer numbers from 1 to k, 0 indicates trimmed observations. Note that it could be empty clusters with no observations when equal.weights=FALSE.
obj - The value of the objective function of the best (returned) solution.
NlogL - A value related to the classification log-likelihood of the best (returned) solution. If opt=="HARD", NlogL = -2*obj.
size - An integer vector of size k, returning the number of observations contained by each cluster.
weights - Vector of Cluster weights
centers - A matrix of size p x k containing the centers (column-wise) of each cluster.
cov - An array of size p x p x k containing the covariance matrices of each cluster.
code - A numerical value indicating if the concentration steps have converged for the returned solution (2).
posterior - A matrix with k columns that contains the posterior probabilities of membership of each observation (row-wise) to the k clusters. This posterior probabilities are 0-1 values in the opt="HARD" case. Trimmed observations have 0 membership probabilities to all clusters.
MIXMIX - BIC which based on the parameters estimated through the mixture log-likelihood and the maximized mixture likelihood as goodness of fit measure. This output is present only if opt="MIXT".
MIXMIX - BIC which uses the classification likelihood based on parameters estimated through the mixture likelihood (In some books this quantity is called ICL). This output is present only if opt="MIXT".
CLACLA - BIC which uses the classification likelihood based on parameters estimated using the classification likelihood. This output is present only if opt="HARD".
cluster.ini - A matrix with nstart rows and number of columns equal to the number of observations and where each row shows the final clustering assignments (0 for trimmed observations) obtained after the niter1 iteration of the nstart random initializations.
obj.ini - A numerical vector of length nstart containing the values of the target function obtained after the niter1 iteration of the nstart random initializations.
x - The input data set.
k - The input number of clusters.
alpha - The input trimming level.

Arguments

x: A matrix or data.frame of dimension n x p, containing the observations (row-wise).
k: The number of clusters initially searched for.
alpha: The proportion of observations to be trimmed.
nstart: The number of random initializations to be performed.
niter1: The number of concentration steps to be performed for the nstart initializations.
niter2: The maximum number of concentration steps to be performed for the nkeep solutions kept for further iteration. The concentration steps are stopped, whenever two consecutive steps lead to the same data partition.
nkeep: The number of iterated initializations (after niter1 concentration steps) with the best values in the target function that are kept for further iterations
iter.max: (deprecated, use the combination nkeep, niter1 and niter2) The maximum number of concentration steps to be performed. The concentration steps are stopped, whenever two consecutive steps lead to the same data partition.
equal.weights: A logical value, specifying whether equal cluster weights shall be considered in the concentration and assignment steps.
restr: Restriction type to control relative cluster scatters. The default value is restr="eigen", so that the maximal ratio between the largest and the smallest of the scatter matrices eigenvalues is constrained to be smaller then or equal to restr.fact (Garcia-Escudero, Gordaliza, Matran, and Mayo-Iscar, 2008). Alternatively, restr="deter" imposes that the maximal ratio between the largest and the smallest of the scatter matrices determinants is smaller or equal than restr.fact (see Garcia-Escudero, Mayo-Iscar and Riani, 2020)
restr.fact: The constant restr.fact >= 1 constrains the allowed differences among group scatters in terms of eigenvalues ratio (if restr="eigen") or determinant ratios (if restr="deter"). Larger values imply larger differences of group scatters, a value of 1 specifies the strongest restriction.
cshape: constraint to apply to the shape matrices, cshape >= 1, (see Garcia-Escudero, Mayo-Iscar and Riani, 2020)). This options only works if restr=='deter'. In this case the default value is cshape=1e10 to ensure the procedure is (virtually) affine equivariant. On the other hand, cshape values close to 1 would force the clusters to be almost spherical (without necessarily the same scatters if restr.fact is strictly greater than 1).
opt: Define the target function to be optimized. A classification likelihood target function is considered if opt="HARD" and a mixture classification likelihood if opt="MIXT".
center: Optional centering of the data: a function or a vector of length p which can optionally be specified for centering x before calculation
scale: Optional scaling of the data: a function or a vector of length p which can optionally be specified for scaling x before calculation
store_x: A logical value, specifying whether the data matrix x shall be included in the result object. By default this value is set to TRUE, because some of the plotting functions depend on this information. However, when big data matrices are handled, the result object's size can be decreased noticeably when setting this parameter to FALSE.
parallel: A logical value, specifying whether the nstart initializations should be done in parallel.
n.cores: The number of cores to use when paralellizing, only taken into account if parallel=TRUE.
zero_tol: The zero tolerance used. By default set to 1e-16.
drop.empty.clust: Logical value specifying, whether empty clusters shall be omitted in the resulting object. (The result structure does not contain center and covariance estimates of empty clusters anymore. Cluster names are reassigned such that the first l clusters (l <= k) always have at least one observation.
trace: Defines the tracing level, which is set to 0 by default. Tracing level 1 gives additional information on the stage of the iterative process.

Author

Javier Crespo Guerrero, Luis Angel Garcia Escudero, Agustin Mayo Iscar.

Details

The procedure allows to deal with robust clustering with an alpha proportion of trimming level and searching for k clusters. We are considering classification trimmed likelihood when using opt=”HARD” so that “hard” or “crisp” clustering assignments are done. On the other hand, mixture trimmed likelihood are applied when using opt=”MIXT” so providing a kind of clusters “posterior” probabilities for the observations. Relative cluster scatter can be restricted when restr="eigen" by constraining the ratio between the largest and the smallest of the scatter matrices eigenvalues by a constant value restr.fact. Setting restr.fact=1, yields the strongest restriction, forcing all clusters to be spherical and equally scattered. Relative cluster scatters can be also restricted with restr="deter" by constraining the ratio between the largest and the smallest of the scatter matrices' determinants.

This iterative algorithm performs "concentration steps" to improve the current cluster assignments. For approximately obtaining the global optimum, the procedure is randomly initialized nstart times and niter1 concentration steps are performed for them. The nkeep most “promising” iterations, i.e. the nkeep iterated solutions with the initial best values for the target function, are then iterated until convergence or until niter2 concentration steps are done.

The parameter restr.fact defines the cluster scatter matrices restrictions, which are applied on all clusters during each concentration step. It restricts the ratio between the maximum and minimum eigenvalue of all clusters' covariance structures to that parameter. Setting restr.fact=1, yields the strongest restriction, forcing all clusters to be spherical and equally scattered.

Cluster components with similar sizes are favoured when considering equal.weights=TRUE while equal.weights=FALSE admits possible different prior probabilities for the components and it can easily return empty clusters when the number of clusters is greater than apparently needed.

References

Fritz, H.; Garcia-Escudero, L.A.; Mayo-Iscar, A. (2012), "tclust: An R Package for a Trimming Approach to Cluster Analysis". Journal of Statistical Software, 47(12), 1-26. URL http://www.jstatsoft.org/v47/i12/

Garcia-Escudero, L.A.; Gordaliza, A.; Matran, C. and Mayo-Iscar, A. (2008), "A General Trimming Approach to Robust Cluster Analysis". Annals of Statistics, Vol.36, 1324--1345.

García-Escudero, L. A., Gordaliza, A. and Mayo-Íscar, A. (2014). A constrained robust proposal for mixture modeling avoiding spurious solutions. Advances in Data Analysis and Classification, 27--43.

García-Escudero, L. A., and Mayo-Íscar, A. and Riani, M. (2020). Model-based clustering with determinant-and-shape constraint. Statistics and Computing, 30, 1363--1380.]

Examples

Run this code


 # \dontshow{
     set.seed (0)
 # }
 ##--- EXAMPLE 1 ------------------------------------------
 sig <- diag(2)
 cen <- rep(1,2)
 x <- rbind(MASS::mvrnorm(360, cen * 0,   sig),
            MASS::mvrnorm(540, cen * 5,   sig * 6 - 2),
            MASS::mvrnorm(100, cen * 2.5, sig * 50))
 
 ## Two groups and 10\% trimming level
 clus <- tclust(x, k = 2, alpha = 0.1, restr.fact = 8)
 
 plot(clus)
 plot(clus, labels = "observation")
 plot(clus, labels = "cluster")
 
 ## Three groups (one of them very scattered) and 0\% trimming level
 clus <- tclust(x, k = 3, alpha=0.0, restr.fact = 100)
 
 plot(clus)
 
 ##--- EXAMPLE 2 ------------------------------------------
 data(geyser2)
 (clus <- tclust(geyser2, k = 3, alpha = 0.03))
 
 plot(clus)
 
# \donttest{

 ##--- EXAMPLE 3 ------------------------------------------
 data(M5data)
 x <- M5data[, 1:2]
 
 clus.a <- tclust(x, k = 3, alpha = 0.1, restr.fact =  1,
                   restr = "eigen", equal.weights = TRUE)
 clus.b <- tclust(x, k = 3, alpha = 0.1, restr.fact =  50,
                    restr = "eigen", equal.weights = FALSE)
 clus.c <- tclust(x, k = 3, alpha = 0.1, restr.fact =  1,
                   restr = "deter", equal.weights = TRUE)
 clus.d <- tclust(x, k = 3, alpha = 0.1, restr.fact = 50,
                   restr = "deter", equal.weights = FALSE)
 
 pa <- par(mfrow = c (2, 2))
 plot(clus.a, main = "(a)")
 plot(clus.b, main = "(b)")
 plot(clus.c, main = "(c)")
 plot(clus.d, main = "(d)")
 par(pa)
 
 ##--- EXAMPLE 4 ------------------------------------------

 data (swissbank)
 ## Two clusters and 8\% trimming level
 (clus <- tclust(swissbank, k = 2, alpha = 0.08, restr.fact = 50))
 
 ## Pairs plot of the clustering solution
 pairs(swissbank, col = clus$cluster + 1)
 ## Two coordinates
 plot(swissbank[, 4], swissbank[, 6], col = clus$cluster + 1,
      xlab = "Distance of the inner frame to lower border",
      ylab = "Length of the diagonal")
 plot(clus)
 
 ## Three clusters and 0\% trimming level
 clus<- tclust(swissbank, k = 3, alpha = 0.0, restr.fact = 110)
 
 ## Pairs plot of the clustering solution
 pairs(swissbank, col = clus$cluster + 1)
 
 ## Two coordinates
 plot(swissbank[, 4], swissbank[, 6], col = clus$cluster + 1, 
       xlab = "Distance of the inner frame to lower border", 
       ylab = "Length of the diagonal")
 
 plot(clus)
 
 ##--- EXAMPLE 5 ------------------------------------------
  data(M5data)
  x <- M5data[, 1:2]
  
  ## Classification trimmed likelihood approach
  clus.a <- tclust(x, k = 3, alpha = 0.1, restr.fact =  50,
                     opt="HARD", restr = "eigen", equal.weights = FALSE)
 ## Mixture trimmed likelihood approach
  clus.b <- tclust(x, k = 3, alpha = 0.1, restr.fact =  50,
                     opt="MIXT", restr = "eigen", equal.weights = FALSE)
 
 ## Hard 0-1 cluster assignment (all 0 if trimmed unit)
 head(clus.a$posterior)
 
 ## Posterior probabilities cluster assignment for the
 ##  mixture approach (all 0 if trimmed unit)
 head(clus.b$posterior)
 
# }

Run the code above in your browser using DataLab