`clustatsum`

computes cluster validation statistics by running
`cqcluster.stats`

,
and potentially `distrsimilarity`

, and collecting some key
statistics values with a somewhat different nomenclature.

This was implemented as a helper function for use inside of
`clusterbenchstats`

and `cgrestandard`

.

```
clustatsum(datadist=NULL,clustering,noisecluster=FALSE,
datanp=NULL,npstats=FALSE,useboot=FALSE,
bootclassif=NULL,
bootmethod="nselectboot",
bootruns=25, cbmethod=NULL,methodpars=NULL,
distmethod=NULL,dnnk=2,
pamcrit=TRUE,...)
```

datadist

distances on which validation-measures are based, `dist`

object or distance matrix. If `NULL`

, this is computed from
`datanp`

; at least one of `datadist`

and `datanp`

must be specified.

clustering

an integer vector of length of the number of cases, which indicates a clustering. The clusters have to be numbered from 1 to the number of clusters.

noisecluster

logical. If `TRUE`

, it is assumed that the
largest cluster number in `clustering`

denotes a 'noise
class', i.e. points that do not belong to any cluster. These points
are not taken into account for the computation of all functions of
within and between cluster distances including the validation
indexes.

datanp

optional observations times variables data matrix, see
`npstats`

.

npstats

logical. If `TRUE`

, `distrsimilarity`

is called and the two statistics computed there are added to the
output. These are based on `datanp`

and require `datanp`

to be specified.

useboot

logical. If `TRUE`

, a stability index (either
`nselectboot`

or `prediction.strength`

) will be involved.

bootclassif

If `useboot=TRUE`

, a string indicating the
classification method to be used with the stability index, see the
`classification`

argument of `nselectboot`

and
`prediction.strength`

.

bootmethod

either `"nselectboot"`

or
`"prediction.strength"`

; stability index to be used if
`useboot=TRUE`

.

bootruns

integer. Number of resampling runs. If
`useboot=TRUE`

, passed on as `B`

to
`nselectboot`

or
`M`

to `prediction.strength`

.

cbmethod

CBI-function (see `kmeansCBI`

); clustering
method to be used for
stability assessment if `useboot=TRUE`

.

methodpars

parameters to be passed on to `cbmethod`

.

distmethod

logical. In case of `useboot=TRUE`

indicates
whether `cbmethod`

will interpret data as distances.

dnnk

`nnk`

-argument to be passed on to
`distrsimilarity`

.

pamcrit

`pamcrit`

-argument to be passed on to
`cqcluster.stats`

.

...

further arguments to be passed on to
`cqcluster.stats`

.

`clustatsum`

returns a list. The components, as listed below, are
outputs of `summary.cquality`

with default parameters,
which means that they are partly transformed versions of those given
out by `cqcluster.stats`

, i.e., their range is between 0
and 1 and large values are good. Those from
`distrsimilarity`

are computed with
`largeisgood=TRUE`

, correspondingly.

average distance within clusters (reweighted so that every observation, rather than every distance, has the same weight).

average distance to `nnk`

th nearest neighbour within
cluster.

coefficient of variation of dissimilarities to
`nnk`

th nearest wthin-cluster neighbour, measuring uniformity of
within-cluster densities, weighted over all clusters, see Sec. 3.7 of
Hennig (2019).

maximum cluster diameter.

widest within-cluster gap or average of cluster-wise
widest within-cluster gap, depending on parameter `averagegap`

.

separation index, see argument `sepindex`

.

minimum cluster separation.

average silhouette
width. See `silhouette`

.

this index measures to what extent the density decreases from the cluster mode to the outskirts; I-densdec in Sec. 3.6 of Hennig (2019); low values are good.

this index measures whether cluster boundaries run through density valleys; I-densbound in Sec. 3.6 of Hennig (2019); low values are good.

this measures whether there is a large within-cluster gap with high density on both sides; I-highdgap in Sec. 3.6 of Hennig (2019); low values are good.

correlation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters. "Normalized gamma" in Halkidi et al. (2001).

a generalisation of the within clusters sum
of squares (k-means objective function), which is obtained if
`d`

is a Euclidean distance matrix. For general distance
measures, this is half
the sum of the within cluster squared dissimilarities divided by the
cluster size.

entropy of the distribution of cluster memberships, see Meila(2007).

average distance to cluster centroid.

Kolmogorov distance between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution, aggregated over clusters (I am grateful to Agustin Mayo-Iscar for the idea).

Kolmogorov distance between distribution of distances to
`nnk`

th nearest within-cluster neighbor and appropriate
Gamma-distribution, see Byers and Raftery (1998), aggregated over
clusters.

if `useboot=TRUE`

, stability value; `stabk`

for
method `nselectboot`

; `mean.pred`

for method
`prediction.strength`

.

Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster
validity indexes for context-adapted comparison of clusterings.
*Statistics and Computing*, 30, 1523-1544,
https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822

Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001) On Clustering
Validation Techniques, *Journal of Intelligent Information
Systems*, 17, 107-145.

Hennig, C. (2019) Cluster validation by measurement of clustering
characteristics relevant to the user. In C. H. Skiadas (ed.)
*Data Analysis and Applications 1: Clustering and Regression,
Modeling-estimating, Forecasting and Data Mining, Volume 2*, Wiley,
New York 1-24,
https://arxiv.org/abs/1703.09282

Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York.

Meila, M. (2007) Comparing clusterings?an information based distance,
*Journal of Multivariate Analysis*, 98, 873-895.

# NOT RUN { set.seed(20000) options(digits=3) face <- rFace(20,dMoNo=2,dNoEy=0,p=2) dface <- dist(face) complete3 <- cutree(hclust(dface),3) clustatsum(dface,complete3) # }