clusterbenchstats: Run and validate many clusterings

Description

This runs the methodology explained in Hennig (2019), Akhanli and Hennig (2020). It runs a user-specified set of clustering methods (CBI-functions, see kmeansCBI) with several numbers of clusters on a dataset, and computes many cluster validation indexes. In order to explore the variation of these indexes, random clusterings on the data are generated, and validation indexes are standardised by use of the random clusterings in order to make them comparable and differences between values interpretable.

The function print.valstat can be used to provide weights for the cluster validation statistics, and will then compute a weighted validation index that can be used to compare all clusterings.

See the examples for how to get the indexes A1 and A2 from Akhanli and Hennig (2020).

Usage

clusterbenchstats(data,G,diss = inherits(data, "dist"),
                                  scaling=TRUE, clustermethod,
                                  methodnames=clustermethod,
                              distmethod=rep(TRUE,length(clustermethod)),
                              ncinput=rep(TRUE,length(clustermethod)),
                              clustermethodpars,
                              npstats=FALSE,
                              useboot=FALSE,
                              bootclassif=NULL,
                              bootmethod="nselectboot",
                              bootruns=25,
                              trace=TRUE,
                              pamcrit=TRUE,snnk=2,
                              dnnk=2,
                              nnruns=100,kmruns=100,fnruns=100,avenruns=100,
                              multicore=FALSE,cores=detectCores()-1,
                              useallmethods=TRUE,
                              useallg=FALSE,...)
# S3 method for clusterbenchstats
print(x,...)

Value

The output of clusterbenchstats is a big list of lists comprising lists cm, stat, sim, qstat, sstat

cm: output object of cluster.magazine, see there for details. Clustering of all methods and numbers of clusters on the dataset data.

stat: object of class "valstat", see valstat.object for details. Unstandardised cluster validation statistics.
sim: output object of randomclustersim, see there. validity indexes from random clusterings used for standardisation of validation statistics on data.
qstat: object of class "valstat", see valstat.object for details. Cluster validation statistics standardised by random clusterings, output of cgrestandard based on percentages, i.e., with percentage=TRUE.
sstat: object of class "valstat", see valstat.object for details. Cluster validation statistics standardised by random clusterings, output of cgrestandard based on mean and standard deviation (called Z-score standardisation in Akhanli and Hennig (2020), i.e., with percentage=FALSE.

Arguments

data: data matrix or dist-object.
G: vector of integers. Numbers of clusters to consider.
diss: logical. If TRUE, the data matrix is assumed to be a distance/dissimilariy matrix, otherwise it's observations times variables.
scaling: either a logical or a numeric vector of length equal to the number of columns of data. If FALSE, data won't be scaled, otherwise scaling is passed on to scale as argumentscale.
clustermethod: vector of strings specifying names of CBI-functions (see kmeansCBI). These are the clustering methods to be applied.
methodnames: vector of strings with user-chosen names for clustering methods, one for every method in clustermethod. These can be used to distinguish different methods run by the same CBI-function but with different parameter values such as complete and average linkage for hclustCBI.
distmethod: vector of logicals, of the same length as clustermethod. TRUE means that the clustering method operates on distances. If diss=TRUE, all entries have to be TRUE. Otherwise, if an entry is true, the corresponding method will be applied on dist(data).
ncinput: vector of logicals, of the same length as clustermethod. TRUE indicates that the corresponding clustering method requires the number of clusters as input and will not estimate the number of clusters itself. Only methods for which this is TRUE can be used with useboot=TRUE.
clustermethodpars: list of the same length as clustermethod. Specifies parameters for all involved clustering methods. Its jth entry is passed to clustermethod number k. Can be an empty entry in case all defaults are used for a clustering method. However, the last entry is not allowed to be empty (you may just set a parameter of the last clustering method to its default value if you don't want to specify anything else)! The number of clusters does not need to be specified here.
npstats: logical. If TRUE, distrsimilarity is called and the two validity statistics computed there are added. These require diss=FALSE.
useboot: logical. If TRUE, a stability index (either nselectboot or prediction.strength) will be involved.
bootclassif: If useboot=TRUE, a vector of strings indicating the classification methods to be used with the stability index for the different methods indicated in clustermethods, see the classification argument of nselectboot and prediction.strength.
bootmethod: either "nselectboot" or "prediction.strength"; stability index to be used if useboot=TRUE.
bootruns: integer. Number of resampling runs. If useboot=TRUE, passed on as B to nselectboot or M to prediction.strength. Note that these are applied to all kmruns+nnruns+avenruns+fnruns random clusterings on top of the regular ones, which may take a lot of time if bootruns and these values are chosen large.
trace: logical. If TRUE, some runtime information is printed.
pamcrit: logical. If TRUE, the average distance of points to their respective cluster centroids is computed (criterion of the PAM clustering method, validation criterion pamc); centroids are chosen so that they minimise this criterion for the given clustering. Passed on to cqcluster.stats.
snnk: integer. Number of neighbours used in coefficient of variation of distance to nearest within cluster neighbour, the cvnnd-statistic (clusters with snnk or fewer points are ignored for this). Passed on to cqcluster.stats as argument nnk.
dnnk: integer. Number of nearest neighbors to use for dissimilarity to the uniform in case that npstats=TRUE; nnk-argument to be passed on to distrsimilarity.
nnruns: integer. Number of runs of stupidknn (random clusterings). With useboot=TRUE one may want to choose this lower than the default for reasons of computation time.
kmruns: integer. Number of runs of stupidkcentroids (random clusterings). With useboot=TRUE one may want to choose this lower than the default for reasons of computation time.
fnruns: integer. Number of runs of stupidkfn (random clusterings). With useboot=TRUE one may want to choose this lower than the default for reasons of computation time.
avenruns: integer. Number of runs of stupidkaven (random clusterings). With useboot=TRUE one may want to choose this lower than the default for reasons of computation time.
multicore: logical. If TRUE, parallel computing is used through the function mclapply from package parallel; read warnings there if you intend to use this; it won't work on Windows.
cores: integer. Number of cores for parallelisation.
useallmethods: logical, to be passed on to cgrestandard. If FALSE, only random clustering results are used for standardisation. If TRUE, clustering results from all methods are used.
useallg: logical to be passed on to cgrestandard. If TRUE, standardisation uses results from all numbers of clusters in G. If FALSE, standardisation of results for a specific number of cluster only uses results from that number of clusters.
...: further arguments to be passed on to cqcluster.stats through clustatsum (no effect in print.clusterbenchstats).
x: object of class "clusterbenchstats".

Author

Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/

References

Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282

Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822

Examples

Run this code

  
  set.seed(20000)
  options(digits=3)
  face <- rFace(10,dMoNo=2,dNoEy=0,p=2)
  clustermethod=c("kmeansCBI","hclustCBI")
# A clustering method can be used more than once, with different
# parameters
  clustermethodpars <- list()
  clustermethodpars[[2]] <- list()
  clustermethodpars[[2]]$method <- "average"
# Last element of clustermethodpars needs to have an entry!
  methodname <- c("kmeans","average")
  cbs <-  clusterbenchstats(face,G=2:3,clustermethod=clustermethod,
    methodname=methodname,distmethod=rep(FALSE,2),
    clustermethodpars=clustermethodpars,nnruns=1,kmruns=1,fnruns=1,avenruns=1)
  print(cbs)
  print(cbs$qstat,aggregate=TRUE,weights=c(1,0,0,0,0,1,0,1,0,1,0,1,0,0,1,1))
# The weights are weights for the validation statistics ordered as in
# cbs$qstat$statistics for computation of an aggregated index, see
# ?print.valstat.

# Now using bootstrap stability assessment as in Akhanli and Hennig (2020):
  bootclassif <- c("centroid","averagedist")
  cbsboot <- clusterbenchstats(face,G=2:3,clustermethod=clustermethod,
    methodname=methodname,distmethod=rep(FALSE,2),
    clustermethodpars=clustermethodpars,
    useboot=TRUE,bootclassif=bootclassif,bootmethod="nselectboot",
    bootruns=2,nnruns=1,kmruns=1,fnruns=1,avenruns=1,useallg=TRUE)
  print(cbsboot)
if (FALSE) {
# Index A1 in Akhanli and Hennig (2020) (need these weights choices):
  print(cbsboot$sstat,aggregate=TRUE,weights=c(1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0))
# Index A2 in Akhanli and Hennig (2020) (need these weights choices):
  print(cbsboot$sstat,aggregate=TRUE,weights=c(0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0))
}

# Results from nselectboot:
  plot(cbsboot$stat,cbsboot$sim,statistic="boot")

Run the code above in your browser using DataLab