cluster.stats: Cluster validation statistics

Description

Computes a number of distance based statistics, which can be used for cluster validation, comparison between clusterings and decision about the number of clusters: cluster sizes, cluster diameters, average distances within and between clusters, cluster separation, biggest within cluster gap, average silhouette widths, the Calinski and Harabasz index, a Pearson version of Hubert's gamma coefficient, the Dunn index and two indexes to assess the similarity of two clusterings, namely the corrected Rand index and Meila's VI.

Usage

cluster.stats(d = NULL, clustering, alt.clustering = NULL,
                           noisecluster=FALSE,
                              silhouette = TRUE, G2 = FALSE, G3 = FALSE,
                              wgap=TRUE, sepindex=TRUE, sepprob=0.1,
                              sepwithnoise=TRUE,
                              compareonly = FALSE,
                              aggregateonly = FALSE)

Arguments

a distance object (as generated by dist) or a distance matrix between cases.

clustering

an integer vector of length of the number of cases, which indicates a clustering. The clusters have to be numbered from 1 to the number of clusters.

alt.clustering

an integer vector such as for clustering, indicating an alternative clustering. If provided, the corrected Rand index and Meila's VI for clustering vs. alt.clustering are computed.

noisecluster

logical. If TRUE, it is assumed that the largest cluster number in clustering denotes a 'noise class', i.e. points that do not belong to any cluster. These points are not taken into account for the computation of all

silhouette

logical. If TRUE, the silhouette statistics are computed, which requires package cluster.

logical. If TRUE, Goodman and Kruskal's index G2 (cf. Gordon (1999), p. 62) is computed. This executes lots of sorting algorithms and can be very slow (it has been improved by R. Francois - thanks!)

logical. If TRUE, the index G3 (cf. Gordon (1999), p. 62) is computed. This executes sort on all distances and can be extremely slow.

wgap

logical. If TRUE, the widest within-cluster gaps (largest link in within-cluster minimum spanning tree) are computed. This is used for finding a good number of clusters in Hennig (2013).

sepindex

logical. If TRUE, a separation index is computed, defined based on the distances for every point to the closest point not in the same cluster. The separation index is then the mean of the smallest proportion sepprob o

sepprob

numerical between 0 and 1, see sepindex.

sepwithnoise

logical. If TRUE and sepindex and noisecluster are both TRUE, the noise points are incorporated as cluster in the separation index computation. Also they are taken into account for the comput

compareonly

logical. If TRUE, only the corrected Rand index and Meila's VI are computed and given out (this requires alt.clustering to be specified).

aggregateonly

logical. If TRUE (and not compareonly), no clusterwise but only aggregated information is given out (this cuts the size of the output down a bit).

Value

cluster.stats returns a list containing the components n, cluster.number, cluster.size, min.cluster.size, noisen, diameter, average.distance, median.distance, separation, average.toother, separation.matrix, average.between, average.within, n.between, n.within, within.cluster.ss, clus.avg.silwidths, avg.silwidth, g2, g3, pearsongamma, dunn, entropy, wb.ratio, ch, corrected.rand, vi except if compareonly=TRUE, in which case only the last two components are computed.
nnumber of cases.
cluster.numbernumber of clusters.
cluster.sizevector of cluster sizes (number of points).
min.cluster.sizesize of smallest cluster.
noisennumber of noise points, see argument noisecluster (noisen=0 if noisecluster=FALSE).
diametervector of cluster diameters (maximum within cluster distances).
average.distancevector of clusterwise within cluster average distances.
median.distancevector of clusterwise within cluster distance medians.
separationvector of clusterwise minimum distances of a point in the cluster to a point of another cluster.
average.toothervector of clusterwise average distances of a point in the cluster to the points of other clusters.
separation.matrixmatrix of separation values between all pairs of clusters.
ave.between.matrixmatrix of mean dissimilarities between points of every pair of clusters.
average.betweenaverage distance between clusters.
average.withinaverage distance within clusters.
n.betweennumber of distances between clusters.
n.withinnumber of distances within clusters.
max.diametermaximum cluster diameter.
min.separationminimum cluster separation.
within.cluster.ssa generalisation of the within clusters sum of squares (k-means objective function), which is obtained if d is a Euclidean distance matrix. For general distance measures, this is half the sum of the within cluster squared dissimilarities divided by the cluster size.
clus.avg.silwidthsvector of cluster average silhouette widths. See silhouette.
avg.silwidthaverage silhouette width. See silhouette.
g2Goodman and Kruskal's Gamma coefficient. See Milligan and Cooper (1985), Gordon (1999, p. 62).
g3G3 coefficient. See Gordon (1999, p. 62).
pearsongammacorrelation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters. "Normalized gamma" in Halkidi et al. (2001).
dunnminimum separation / maximum diameter. Dunn index, see Halkidi et al. (2002).
dunn2minimum average dissimilarity between two cluster / maximum average within cluster dissimilarity, another version of the family of Dunn indexes.
entropyentropy of the distribution of cluster memberships, see Meila(2007).
wb.ratioaverage.within/average.between.
chCalinski and Harabasz index (Calinski and Harabasz 1974, optimal in Milligan and Cooper 1985; generalised for dissimilarites in Hennig and Liao 2010).
cwidegapvector of widest within-cluster gaps.
widestgapwidest within-cluster gap.
sindexseparation index, see argument sepindex.
corrected.randcorrected Rand index (if alt.clustering has been specified), see Gordon (1999, p. 198).
vivariation of information (VI) index (if alt.clustering has been specified), see Meila (2007).

References

Calinski, T., and Harabasz, J. (1974) A Dendrite Method for Cluster Analysis, Communications in Statistics, 3, 1-27.

Gordon, A. D. (1999) Classification, 2nd ed. Chapman and Hall.

Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001) On Clustering Validation Techniques, Journal of Intelligent Information Systems, 17, 107-145.

Hennig, C. and Liao, T. (2010) Comparing latent class and dissimilarity based clustering for mixed type variables with application to social stratification. Research report no. 308, Department of Statistical Science, UCL. http://www.ucl.ac.uk/Stats/research/reports/psfiles/rr308.pdf. Revised version accepted for publication by Journal of the Royal Statistical Society Series C.

Hennig, C. (2013) How many bee species? A case study in determining the number of clusters. To appear in Proceedings of GfKl-2012, Hildesheim. Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York. Meila, M. (2007) Comparing clusterings?an information based distance, Journal of Multivariate Analysis, 98, 873-895. Milligan, G. W. and Cooper, M. C. (1985) An examination of procedures for determining the number of clusters. Psychometrika, 50, 159-179.

Examples

Run this code

set.seed(20000)
  face <- rFace(200,dMoNo=2,dNoEy=0,p=2)
  dface <- dist(face)
  complete3 <- cutree(hclust(dface),3)
  cluster.stats(dface,complete3,
                alt.clustering=as.integer(attr(face,"grouping")))

Run the code above in your browser using DataLab