Computes a number of distance based statistics, which can be used for cluster validation, comparison between clusterings and decision about the number of clusters: cluster sizes, cluster diameters, average distances within and between clusters, cluster separation, biggest within cluster gap, average silhouette widths, the Calinski and Harabasz index, a Pearson version of Hubert's gamma coefficient, the Dunn index and two indexes to assess the similarity of two clusterings, namely the corrected Rand index and Meila's VI.

```
cluster.stats(d = NULL, clustering, alt.clustering = NULL,
noisecluster=FALSE,
silhouette = TRUE, G2 = FALSE, G3 = FALSE,
wgap=TRUE, sepindex=TRUE, sepprob=0.1,
sepwithnoise=TRUE,
compareonly = FALSE,
aggregateonly = FALSE)
```

d

a distance object (as generated by `dist`

) or a distance
matrix between cases.

clustering

an integer vector of length of the number of cases, which indicates a clustering. The clusters have to be numbered from 1 to the number of clusters.

alt.clustering

an integer vector such as for
`clustering`

, indicating an alternative clustering. If provided, the
corrected Rand index and Meila's VI for `clustering`

vs. `alt.clustering`

are computed.

noisecluster

logical. If `TRUE`

, it is assumed that the
largest cluster number in `clustering`

denotes a 'noise
class', i.e. points that do not belong to any cluster. These points
are not taken into account for the computation of all functions of
within and between cluster distances including the validation
indexes.

silhouette

logical. If `TRUE`

, the silhouette statistics
are computed, which requires package `cluster`

.

G2

logical. If `TRUE`

, Goodman and Kruskal's index G2
(cf. Gordon (1999), p. 62) is computed. This executes lots of
sorting algorithms and can be very slow (it has been improved
by R. Francois - thanks!)

G3

logical. If `TRUE`

, the index G3
(cf. Gordon (1999), p. 62) is computed. This executes `sort`

on all distances and can be extremely slow.

wgap

logical. If `TRUE`

, the widest within-cluster gaps
(largest link in within-cluster minimum spanning tree) are
computed. This is used for finding a good number of clusters in
Hennig (2013).

sepindex

logical. If `TRUE`

, a separation index is
computed, defined based on the distances for every point to the
closest point not in the same cluster. The separation index is then
the mean of the smallest proportion `sepprob`

of these. This
allows to formalise separation less sensitive to a single or a few
ambiguous points. The output component corresponding to this is
`sindex`

, not `separation`

! This is used for finding a
good number of clusters in Hennig (2013).

sepprob

numerical between 0 and 1, see `sepindex`

.

sepwithnoise

logical. If `TRUE`

and `sepindex`

and
`noisecluster`

are both `TRUE`

, the noise points are
incorporated as cluster in the separation index (`sepindex`

)
computation. Also
they are taken into account for the computation for the minimum
cluster separation.

compareonly

logical. If `TRUE`

, only the corrected Rand index
and Meila's VI are
computed and given out (this requires `alt.clustering`

to be
specified).

aggregateonly

logical. If `TRUE`

(and not
`compareonly`

), no clusterwise but only aggregated information
is given out (this cuts the size of the output down a bit).

`cluster.stats`

returns a list containing the components
```
n, cluster.number, cluster.size, min.cluster.size, noisen,
diameter,
average.distance, median.distance, separation, average.toother,
separation.matrix, average.between, average.within,
n.between, n.within, within.cluster.ss, clus.avg.silwidths, avg.silwidth,
g2, g3, pearsongamma, dunn, entropy, wb.ratio, ch, cwidegap,
widestgap, sindex,
corrected.rand, vi
```

except if `compareonly=TRUE`

, in which case
only the last two components are computed.

number of cases.

number of clusters.

vector of cluster sizes (number of points).

size of smallest cluster.

number of noise points, see argument `noisecluster`

(`noisen=0`

if `noisecluster=FALSE`

).

vector of cluster diameters (maximum within cluster distances).

vector of clusterwise within cluster average distances.

vector of clusterwise within cluster distance medians.

vector of clusterwise minimum distances of a point in the cluster to a point of another cluster.

vector of clusterwise average distances of a point in the cluster to the points of other clusters.

matrix of separation values between all pairs of clusters.

matrix of mean dissimilarities between points of every pair of clusters.

average distance between clusters.

average distance within clusters (reweighted so that every observation, rather than every distance, has the same weight).

number of distances between clusters.

number of distances within clusters.

maximum cluster diameter.

minimum cluster separation.

a generalisation of the within clusters sum
of squares (k-means objective function), which is obtained if
`d`

is a Euclidean distance matrix. For general distance
measures, this is half
the sum of the within cluster squared dissimilarities divided by the
cluster size.

vector of cluster average silhouette
widths. See
`silhouette`

.

average silhouette
width. See
`silhouette`

.

Goodman and Kruskal's Gamma coefficient. See Milligan and Cooper (1985), Gordon (1999, p. 62).

G3 coefficient. See Gordon (1999, p. 62).

correlation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters. "Normalized gamma" in Halkidi et al. (2001).

minimum separation / maximum diameter. Dunn index, see Halkidi et al. (2002).

minimum average dissimilarity between two cluster / maximum average within cluster dissimilarity, another version of the family of Dunn indexes.

entropy of the distribution of cluster memberships, see Meila(2007).

`average.within/average.between`

.

Calinski and Harabasz index (Calinski and Harabasz 1974, optimal in Milligan and Cooper 1985; generalised for dissimilarites in Hennig and Liao 2013).

vector of widest within-cluster gaps.

widest within-cluster gap.

separation index, see argument `sepindex`

.

corrected Rand index (if `alt.clustering`

has been specified), see Gordon (1999, p. 198).

variation of information (VI) index (if `alt.clustering`

has been specified), see Meila (2007).

Calinski, T., and Harabasz, J. (1974) A Dendrite Method for Cluster
Analysis, *Communications in Statistics*, 3, 1-27.

Gordon, A. D. (1999) *Classification*, 2nd ed. Chapman and Hall.

Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001) On Clustering
Validation Techniques, *Journal of Intelligent Information
Systems*, 17, 107-145.

Hennig, C. and Liao, T. (2013) How to find an appropriate clustering
for mixed-type variables with application to socio-economic
stratification, *Journal of the Royal Statistical Society, Series
C Applied Statistics*, 62, 309-369.

Hennig, C. (2013) How many bee species? A case study in determining the number of clusters. In: Spiliopoulou, L. Schmidt-Thieme, R. Janning (eds.): "Data Analysis, Machine Learning and Knowledge Discovery", Springer, Berlin, 41-49.

Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York.

Meila, M. (2007) Comparing clusterings?an information based distance,
*Journal of Multivariate Analysis*, 98, 873-895.

Milligan, G. W. and Cooper, M. C. (1985) An examination of procedures
for determining the number of clusters. *Psychometrika*, 50, 159-179.

`cqcluster.stats`

is a more sophisticated version of
`cluster.stats`

with more options.
`silhouette`

, `dist`

, `calinhara`

,
`distcritmulti`

.
`clusterboot`

computes clusterwise stability statistics by
resampling.

# NOT RUN { set.seed(20000) options(digits=3) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) dface <- dist(face) complete3 <- cutree(hclust(dface),3) cluster.stats(dface,complete3, alt.clustering=as.integer(attr(face,"grouping"))) # }