This runs the methodology explained in Hennig (2019), Akhanli and
Hennig (2020). It runs a
user-specified set of clustering methods (CBI-functions, see
`kmeansCBI`

) with several numbers of clusters on a dataset,
and computes many cluster validation indexes. In order to explore the
variation of these indexes, random clusterings on the data are
generated, and validation indexes are standardised by use of the
random clusterings in order to make them comparable and differences
between values interpretable.

The function `print.valstat`

can be used to provide
weights for the cluster
validation statistics, and will then compute a weighted validation index
that can be used to compare all clusterings.

See the examples for how to get the indexes A1 and A2 from Akhanli and Hennig (2020).

```
clusterbenchstats(data,G,diss = inherits(data, "dist"),
scaling=TRUE, clustermethod,
methodnames=clustermethod,
distmethod=rep(TRUE,length(clustermethod)),
ncinput=rep(TRUE,length(clustermethod)),
clustermethodpars,
npstats=FALSE,
useboot=FALSE,
bootclassif=NULL,
bootmethod="nselectboot",
bootruns=25,
trace=TRUE,
pamcrit=TRUE,snnk=2,
dnnk=2,
nnruns=100,kmruns=100,fnruns=100,avenruns=100,
multicore=FALSE,cores=detectCores()-1,
useallmethods=TRUE,
useallg=FALSE,...)
```# S3 method for clusterbenchstats
print(x,...)

data

data matrix or `dist`

-object.

G

vector of integers. Numbers of clusters to consider.

diss

logical. If `TRUE`

, the data matrix is assumed to be
a distance/dissimilariy matrix, otherwise it's observations times
variables.

scaling

either a logical or a numeric vector of length equal to
the number of columns of `data`

. If `FALSE`

, data won't be
scaled, otherwise `scaling`

is passed on to `scale`

as
argument`scale`

.

clustermethod

vector of strings specifying names of
CBI-functions (see `kmeansCBI`

). These are the
clustering methods to be applied.

methodnames

vector of strings with user-chosen names for
clustering methods, one for every method in
`clustermethod`

. These can be used to distinguish different methods
run by the same CBI-function but with
different parameter values such as complete and average linkage for
`hclustCBI`

.

distmethod

vector of logicals, of the same length as
`clustermethod`

. `TRUE`

means that the clustering method
operates on distances. If `diss=TRUE`

, all entries have to be
`TRUE`

. Otherwise, if an entry is true, the corresponding
method will be applied on `dist(data)`

.

ncinput

vector of logicals, of the same length as
`clustermethod`

. `TRUE`

indicates that the corresponding
clustering method requires the number of clusters as input and will
not estimate the number of clusters itself. Only methods for which
this is `TRUE`

can be used with `useboot=TRUE`

.

clustermethodpars

list of the same length as
`clustermethod`

. Specifies parameters for all involved
clustering methods. Its jth entry is passed to clustermethod number
k. Can be an empty entry in case all defaults are used for a
clustering method. However, the last entry is not allowed to be
empty (you may just set a parameter of the last clustering method to
its default value if you don't want to specify anything else)! The
number of clusters does not need to be
specified here.

npstats

logical. If `TRUE`

, `distrsimilarity`

is called and the two validity statistics computed there are
added. These require `diss=FALSE`

.

useboot

logical. If `TRUE`

, a stability index (either
`nselectboot`

or `prediction.strength`

) will be involved.

bootclassif

If `useboot=TRUE`

, a vector of strings
indicating the
classification methods to be used with the stability index for the
different methods indicated in `clustermethods`

, see the
`classification`

argument of `nselectboot`

and
`prediction.strength`

.

bootmethod

either `"nselectboot"`

or
`"prediction.strength"`

; stability index to be used if
`useboot=TRUE`

.

bootruns

integer. Number of resampling runs. If
`useboot=TRUE`

, passed on as `B`

to
`nselectboot`

or
`M`

to `prediction.strength`

. Note that these are
applied to all `kmruns+nnruns+avenruns+fnruns`

random
clusterings on top of the regular ones, which may take a lot of time
if `bootruns`

and these values are chosen large.

trace

logical. If `TRUE`

, some runtime information is
printed.

pamcrit

logical. If `TRUE`

, the average distance of points
to their respective cluster centroids is computed (criterion of the
PAM clustering method, validation criterion `pamc`

); centroids
are chosen so that they minimise
this criterion for the given clustering. Passed on to
`cqcluster.stats`

.

snnk

integer. Number of neighbours used in coefficient of
variation of distance to nearest within cluster neighbour, the
`cvnnd`

-statistic (clusters
with `snnk`

or fewer points are ignored for this). Passed on to
`cqcluster.stats`

as argument `nnk`

.

dnnk

integer. Number of nearest neighbors to use for
dissimilarity to the uniform in case that `npstats=TRUE`

;
`nnk`

-argument to be passed on to `distrsimilarity`

.

nnruns

integer. Number of runs of `stupidknn`

(random clusterings). With `useboot=TRUE`

one may want to
choose this lower than the default for reasons of computation time.

kmruns

integer. Number of runs of
`stupidkcentroids`

(random clusterings). With
`useboot=TRUE`

one may want to
choose this lower than the default for reasons of computation time.

fnruns

integer. Number of runs of `stupidkfn`

(random clusterings). With `useboot=TRUE`

one may want to
choose this lower than the default for reasons of computation time.

avenruns

integer. Number of runs of `stupidkaven`

(random clusterings). With `useboot=TRUE`

one may want to
choose this lower than the default for reasons of computation time.

multicore

logical. If `TRUE`

, parallel computing is used
through the function `mclapply`

from package
`parallel`

; read warnings there if you intend to use this; it
won't work on Windows.

cores

integer. Number of cores for parallelisation.

useallmethods

logical, to be passed on to
`cgrestandard`

. If `FALSE`

, only random clustering
results are used for standardisation. If
`TRUE`

, clustering results from all methods are used.

useallg

logical to be passed on to
`cgrestandard`

. If `TRUE`

, standardisation uses results
from all numbers of clusters in `G`

. If `FALSE`

,
standardisation of results for a specific number of cluster only
uses results from that number of clusters.

...

further arguments to be passed on to
`cqcluster.stats`

through `clustatsum`

(no
effect in `print.clusterbenchstats`

).

x

object of class `"clusterbenchstats"`

.

The output of `clusterbenchstats`

is a
big list of lists comprising lists ```
cm, stat, sim, qstat,
sstat
```

output object of `cluster.magazine`

, see there
for details. Clustering of all methods and numbers of clusters on
the dataset `data`

.

object of class `"valstat"`

, see
`valstat.object`

for details. Unstandardised cluster
validation statistics.

output object of `randomclustersim`

, see there.
validity indexes from random clusterings used for standardisation of
validation statistics on `data`

.

object of class `"valstat"`

, see
`valstat.object`

for details. Cluster validation
statistics standardised by random clusterings, output of
`cgrestandard`

based on percentages, i.e., with
`percentage=TRUE`

.

object of class `"valstat"`

, see
`valstat.object`

for details. Cluster validation
statistics standardised by random clusterings, output of
`cgrestandard`

based on mean and standard deviation
(called Z-score standardisation in Akhanli and Hennig (2020),
i.e., with `percentage=FALSE`

.

Hennig, C. (2019) Cluster validation by measurement of clustering
characteristics relevant to the user. In C. H. Skiadas (ed.)
*Data Analysis and Applications 1: Clustering and Regression,
Modeling-estimating, Forecasting and Data Mining, Volume 2*, Wiley,
New York 1-24,
https://arxiv.org/abs/1703.09282

Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster
validity indexes for context-adapted comparison of clusterings.
*Statistics and Computing*, 30, 1523-1544,
https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822

`valstat.object`

,
`cluster.magazine`

, `kmeansCBI`

,
`cqcluster.stats`

, `clustatsum`

,
`cgrestandard`

# NOT RUN { set.seed(20000) options(digits=3) face <- rFace(10,dMoNo=2,dNoEy=0,p=2) clustermethod=c("kmeansCBI","hclustCBI") # A clustering method can be used more than once, with different # parameters clustermethodpars <- list() clustermethodpars[[2]] <- list() clustermethodpars[[2]]$method <- "average" # Last element of clustermethodpars needs to have an entry! methodname <- c("kmeans","average") cbs <- clusterbenchstats(face,G=2:3,clustermethod=clustermethod, methodname=methodname,distmethod=rep(FALSE,2), clustermethodpars=clustermethodpars,nnruns=1,kmruns=1,fnruns=1,avenruns=1) print(cbs) print(cbs$qstat,aggregate=TRUE,weights=c(1,0,0,0,0,1,0,1,0,1,0,1,0,0,1,1)) # The weights are weights for the validation statistics ordered as in # cbs$qstat$statistics for computation of an aggregated index, see # ?print.valstat. # Now using bootstrap stability assessment as in Akhanli and Hennig (2020): bootclassif <- c("centroid","averagedist") cbsboot <- clusterbenchstats(face,G=2:3,clustermethod=clustermethod, methodname=methodname,distmethod=rep(FALSE,2), clustermethodpars=clustermethodpars, useboot=TRUE,bootclassif=bootclassif,bootmethod="nselectboot", bootruns=2,nnruns=1,kmruns=1,fnruns=1,avenruns=1,useallg=TRUE) print(cbsboot) # } # NOT RUN { # Index A1 in Akhanli and Hennig (2020) (need these weights choices): print(cbsboot$sstat,aggregate=TRUE,weights=c(1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0)) # Index A2 in Akhanli and Hennig (2020) (need these weights choices): print(cbsboot$sstat,aggregate=TRUE,weights=c(0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0)) # } # NOT RUN { # Results from nselectboot: plot(cbsboot$stat,cbsboot$sim,statistic="boot") # }