cluster.stats(d = NULL, clustering, alt.clustering = NULL,
noisecluster=FALSE,
silhouette = TRUE, G2 = FALSE, G3 = FALSE,
wgap=TRUE, sepindex=TRUE, sepprob=0.1,
sepwithnoise=TRUE,
compareonly = FALSE,
aggregateonly = FALSE)
dist
) or a distance
matrix between cases.clustering
, indicating an alternative clustering. If provided, the
corrected Rand index and Meila's VI for clustering
vs. alt.clustering
are computed.TRUE
, it is assumed that the
largest cluster number in clustering
denotes a 'noise
class', i.e. points that do not belong to any cluster. These points
are not taken into account for the computation of all TRUE
, the silhouette statistics
are computed, which requires package cluster
.TRUE
, Goodman and Kruskal's index G2
(cf. Gordon (1999), p. 62) is computed. This executes lots of
sorting algorithms and can be very slow (it has been improved
by R. Francois - thanks!)TRUE
, the index G3
(cf. Gordon (1999), p. 62) is computed. This executes sort
on all distances and can be extremely slow.TRUE
, the widest within-cluster gaps
(largest link in within-cluster minimum spanning tree) are
computed. This is used for finding a good number of clusters in
Hennig (2013).TRUE
, a separation index is
computed, defined based on the distances for every point to the
closest point not in the same cluster. The separation index is then
the mean of the smallest proportion sepprob
osepindex
.TRUE
and sepindex
and
noisecluster
are both TRUE
, the noise points are
incorporated as cluster in the separation index computation. Also
they are taken into account for the computTRUE
, only the corrected Rand index
and Meila's VI are
computed and given out (this requires alt.clustering
to be
specified).TRUE
(and not
compareonly
), no clusterwise but only aggregated information
is given out (this cuts the size of the output down a bit).cluster.stats
returns a list containing the components
n, cluster.number, cluster.size, min.cluster.size, noisen,
diameter,
average.distance, median.distance, separation, average.toother,
separation.matrix, average.between, average.within,
n.between, n.within, within.cluster.ss, clus.avg.silwidths, avg.silwidth,
g2, g3, pearsongamma, dunn, entropy, wb.ratio, ch,
corrected.rand, vi
except if compareonly=TRUE
, in which case
only the last two components are computed.noisecluster
(noisen=0
if noisecluster=FALSE
).d
is a Euclidean distance matrix. For general distance
measures, this is half
the sum of the within cluster squared dissimilarities divided by the
cluster size.silhouette
.silhouette
.average.within/average.between
.sepindex
.alt.clustering
has been specified), see Gordon (1999, p. 198).alt.clustering
has been specified), see Meila (2007).Gordon, A. D. (1999) Classification, 2nd ed. Chapman and Hall.
Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001) On Clustering Validation Techniques, Journal of Intelligent Information Systems, 17, 107-145.
Hennig, C. and Liao, T. (2010) Comparing latent class and
dissimilarity based clustering for mixed type variables with
application to social stratification. Research report no. 308,
Department of Statistical Science, UCL.
Hennig, C. (2013) How many bee species? A case study in determining the number of clusters. To appear in Proceedings of GfKl-2012, Hildesheim. Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York. Meila, M. (2007) Comparing clusterings?an information based distance, Journal of Multivariate Analysis, 98, 873-895. Milligan, G. W. and Cooper, M. C. (1985) An examination of procedures for determining the number of clusters. Psychometrika, 50, 159-179.
silhouette
, dist
, calinhara
,
distcritmulti
.
clusterboot
computes clusterwise stability statistics by
resampling.set.seed(20000)
face <- rFace(200,dMoNo=2,dNoEy=0,p=2)
dface <- dist(face)
complete3 <- cutree(hclust(dface),3)
cluster.stats(dface,complete3,
alt.clustering=as.integer(attr(face,"grouping")))
Run the code above in your browser using DataLab