silhouette
Compute or Extract Silhouette Information from Clustering
Compute silhouette information according to a given clustering in $k$ clusters.
- Keywords
- cluster
Usage
silhouette(x, ...)
## S3 method for class 'default':
silhouette(x, dist, dmatrix, \dots)
## S3 method for class 'partition':
silhouette(x, \dots)
## S3 method for class 'clara':
silhouette(x, full = FALSE, \dots)sortSilhouette(object, ...)
## S3 method for class 'silhouette':
summary(object, FUN = mean, \dots)
## S3 method for class 'silhouette':
plot(x, nmax.lab = 40, max.strlen = 5,
main = NULL, sub = NULL, xlab = expression("Silhouette width "* s[i]),
col = "gray", do.col.sort = length(col) > 1, border = 0,
cex.names = par("cex.axis"), do.n.k = TRUE, do.clus.stat = TRUE, ...)
Arguments
- x
- an object of appropriate class; for the
default
method an integer vector with $k$ different integer cluster codes or a list with such anx$clustering
component. Note that silhouette statistics are only defined if - dist
- a dissimilarity object inheriting from class
dist
or coercible to one. If not specified,dmatrix
must be. - dmatrix
- a symmetric dissimilarity matrix ($n \times n$),
specified instead of
dist
, which can be more efficient. - full
- logical specifying if a full silhouette should be
computed for
clara
object. Note that this requires $O(n^2)$ memory, since the full dissimilarity (see - object
- an object of class
silhouette
. - ...
- further arguments passed to and from methods.
- FUN
- function used to summarize silhouette widths.
- nmax.lab
- integer indicating the number of labels which is considered too large for single-name labeling the silhouette plot.
- max.strlen
- positive integer giving the length to which strings are truncated in silhouette plot labeling.
- main, sub, xlab
- arguments to
title
; have a sensible non-NULL default here. - col, border, cex.names
- arguments passed
barplot()
; note that the default used to becol = heat.colors(n), border = par("fg")
instead.col
can also be a color vector of length $k$ for cluste - do.col.sort
- logical indicating if the colors
col
should be sortedalong the silhouette; this is useful for casewise or clusterwise coloring. - do.n.k
- logical indicating if $n$ and $k$
title text should be written. - do.clus.stat
- logical indicating if cluster size and averages should be written right to the silhouettes.
Details
For each observation i, the silhouette width $s(i)$ is
defined as follows:
Put a(i) = average dissimilarity between i and all other points of the
cluster to which i belongs (if i is the only observation in
its cluster, $s(i) := 0$ without further calculations).
For all other clusters C, put $d(i,C)$ = average
dissimilarity of i to all observations of C. The smallest of these
$d(i,C)$ is $b(i) := \min_C d(i,C)$,
and can be seen as the dissimilarity between i and its
silhouette.default()
is now based on C code donated by Romain
Francois (the R version being still available as
cluster:::silhouette.default.R
).
Observations with a large $s(i)$ (almost 1) are very well clustered, a small $s(i)$ (around 0) means that the observation lies between two clusters, and observations with a negative $s(i)$ are probably placed in the wrong cluster.
Value
silhouette()
returns an object,sil
, of classsilhouette
which is an [n x 3] matrix with attributes. For each observation i,sil[i,]
contains the cluster to which i belongs as well as the neighbor cluster of i (the cluster, not containing i, for which the average dissimilarity between its observations and i is minimal), and the silhouette width $s(i)$ of the observation. Thecolnames
correspondingly arec("cluster", "neighbor", "sil_width")
.summary(sil)
returns an object of classsummary.silhouette
, a list with componentssi.summary numerical summary
of the individual silhouette widths $s(i)$.clus.avg.widths numeric (rank 1) array of clusterwise means of silhouette widths where mean = FUN
is used.avg.width the total mean FUN(s)
wheres
are the individual silhouette widths.clus.sizes table
of the $k$ cluster sizes.call if available, the call creating sil
.Ordered logical identical to attr(sil, "Ordered")
, see below.sortSilhouette(sil)
orders the rows ofsil
as in the silhouette plot, by cluster (increasingly) and decreasing silhouette width $s(i)$.attr(sil, "Ordered")
is a logical indicating ifsil
is ordered as bysortSilhouette()
. In that case,rownames(sil)
will contain case labels or numbers, andattr(sil, "iOrd")
the ordering index vector.
Note
While silhouette()
is intrinsic to the
partition
clusterings, and hence has a (trivial) method
for these, it is straightforward to get silhouettes from hierarchical
clusterings from silhouette.default()
with
cutree()
and distance as input.
By default, for clara()
partitions, the silhouette is
just for the best random subset used. Use full = TRUE
to compute (and later possibly plot) the full silhouette.
References
Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20, 53--65.
chapter 2 of Kaufman, L. and Rousseeuw, P.J. (1990), see
the references in plot.agnes
.
See Also
Examples
data(ruspini)
pr4 <- pam(ruspini, 4)
str(si <- silhouette(pr4))
(ssi <- summary(si))
plot(si) # silhouette plot
plot(si, col = c("red", "green", "blue", "purple"))# with cluster-wise coloring
si2 <- silhouette(pr4$clustering, dist(ruspini, "canberra"))
summary(si2) # has small values: "canberra"'s fault
plot(si2, nmax= 80, cex.names=0.6)
op <- par(mfrow= c(3,2), oma= c(0,0, 3, 0),
mgp= c(1.6,.8,0), mar= .1+c(4,2,2,2))
for(k in 2:6)
plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE)
mtext("PAM(Ruspini) as in Kaufman & Rousseeuw, p.101",
outer = TRUE, font = par("font.main"), cex = par("cex.main")); frame()
## the same with cluster-wise colours:
c6 <- c("tomato", "forest green", "dark blue", "purple2", "goldenrod4", "gray20")
for(k in 2:6)
plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE,
col = c6[1:k])
par(op)
## clara(): standard silhouette is just for the best random subset
data(xclara)
set.seed(7)
str(xc1k <- xclara[sample(nrow(xclara), size = 1000) ,])
cl3 <- clara(xc1k, 3)
plot(silhouette(cl3))# only of the "best" subset of 46
## The full silhouette: internally needs large (36 MB) dist object:
sf <- silhouette(cl3, full = TRUE) ## this is the same as
s.full <- silhouette(cl3$clustering, daisy(xc1k))
if(paste(R.version$major, R.version$minor, sep=".") >= "2.3.0")
stopifnot(all.equal(sf, s.full, check.attributes = FALSE, tol = 0))
## color dependent on original "3 groups of each 1000":
plot(sf, col = 2+ as.integer(names(cl3$clustering) ) %/% 1000,
main ="plot(silhouette(clara(.), full = TRUE))")
## Silhouette for a hierarchical clustering:
ar <- agnes(ruspini)
si3 <- silhouette(cutree(ar, k = 5), # k = 4 gave the same as pam() above
daisy(ruspini))
plot(si3, nmax = 80, cex.names = 0.5)
## 2 groups: Agnes() wasn't too good:
si4 <- silhouette(cutree(ar, k = 2), daisy(ruspini))
plot(si4, nmax = 80, cex.names = 0.5)