# silhouette

##### Compute or Extract Silhouette Information from Clustering

Compute silhouette information according to a given clustering in \(k\) clusters.

- Keywords
- cluster

##### Usage

```
silhouette(x, …)
# S3 method for default
silhouette (x, dist, dmatrix, …)
# S3 method for partition
silhouette(x, …)
# S3 method for clara
silhouette(x, full = FALSE, …)
```sortSilhouette(object, …)
# S3 method for silhouette
summary(object, FUN = mean, …)
# S3 method for silhouette
plot(x, nmax.lab = 40, max.strlen = 5,
main = NULL, sub = NULL, xlab = expression("Silhouette width "* s[i]),
col = "gray", do.col.sort = length(col) > 1, border = 0,
cex.names = par("cex.axis"), do.n.k = TRUE, do.clus.stat = TRUE, …)

##### Arguments

- x
an object of appropriate class; for the

`default`

method an integer vector with \(k\) different integer cluster codes or a list with such an`x$clustering`

component. Note that silhouette statistics are only defined if \(2 \le k \le n-1\).- dist
a dissimilarity object inheriting from class

`dist`

or coercible to one. If not specified,`dmatrix`

must be.- dmatrix
a symmetric dissimilarity matrix (\(n \times n\)), specified instead of

`dist`

, which can be more efficient.- full
logical specifying if a

*full*silhouette should be computed for`clara`

object. Note that this requires \(O(n^2)\) memory, since the full dissimilarity (see`daisy`

) is needed internally.- object
an object of class

`silhouette`

.- …
further arguments passed to and from methods.

- FUN
function used to summarize silhouette widths.

- nmax.lab
integer indicating the number of labels which is considered too large for single-name labeling the silhouette plot.

- max.strlen
positive integer giving the length to which strings are truncated in silhouette plot labeling.

- main, sub, xlab
arguments to

`title`

; have a sensible non-NULL default here.- col, border, cex.names
arguments passed

`barplot()`

; note that the default used to be`col = heat.colors(n), border = par("fg")`

instead.`col`

can also be a color vector of length \(k\) for clusterwise coloring, see also`do.col.sort`

:- do.col.sort
logical indicating if the colors

`col`

should be sorted “along” the silhouette; this is useful for casewise or clusterwise coloring.- do.n.k
logical indicating if \(n\) and \(k\) “title text” should be written.

- do.clus.stat
logical indicating if cluster size and averages should be written right to the silhouettes.

##### Details

For each observation i, the *silhouette width* \(s(i)\) is
defined as follows:
Put a(i) = average dissimilarity between i and all other points of the
cluster to which i belongs (if i is the *only* observation in
its cluster, \(s(i) := 0\) without further calculations).
For all *other* clusters C, put \(d(i,C)\) = average
dissimilarity of i to all observations of C. The smallest of these
\(d(i,C)\) is \(b(i) := \min_C d(i,C)\),
and can be seen as the dissimilarity between i and its “neighbor”
cluster, i.e., the nearest one to which it does *not* belong.
Finally, $$s(i) := \frac{b(i) - a(i) }{max(a(i), b(i))}.$$

`silhouette.default()`

is now based on C code donated by Romain
Francois (the R version being still available as
`cluster:::silhouette.default.R`

).

Observations with a large \(s(i)\) (almost 1) are very well clustered, a small \(s(i)\) (around 0) means that the observation lies between two clusters, and observations with a negative \(s(i)\) are probably placed in the wrong cluster.

##### Value

`silhouette()`

returns an object, `sil`

, of class
`silhouette`

which is an \(n \times 3\) matrix with
attributes. For each observation i, `sil[i,]`

contains the
cluster to which i belongs as well as the neighbor cluster of i (the
cluster, not containing i, for which the average dissimilarity between its
observations and i is minimal), and the silhouette width \(s(i)\) of
the observation. The `colnames`

correspondingly are
`c("cluster", "neighbor", "sil_width")`

.

`summary(sil)`

returns an object of class
`summary.silhouette`

, a list with components

`si.summary`

:numerical

`summary`

of the individual silhouette widths \(s(i)\).`clus.avg.widths`

:numeric (rank 1) array of clusterwise

*means*of silhouette widths where`mean = FUN`

is used.`avg.width`

:the total mean

`FUN(s)`

where`s`

are the individual silhouette widths.`clus.sizes`

:`table`

of the \(k\) cluster sizes.`call`

:if available, the

`call`

creating`sil`

.`Ordered`

:logical identical to

`attr(sil, "Ordered")`

, see below.

`sortSilhouette(sil)`

orders the rows of `sil`

as in the
silhouette plot, by cluster (increasingly) and decreasing silhouette
width \(s(i)\).
`attr(sil, "Ordered")`

is a logical indicating if `sil`

*is*
ordered as by `sortSilhouette()`

. In that case,
`rownames(sil)`

will contain case labels or numbers, and
`attr(sil, "iOrd")`

the ordering index vector.

##### Note

While `silhouette()`

is *intrinsic* to the
`partition`

clusterings, and hence has a (trivial) method
for these, it is straightforward to get silhouettes from hierarchical
clusterings from `silhouette.default()`

with
`cutree()`

and distance as input.

By default, for `clara()`

partitions, the silhouette is
just for the best random *subset* used. Use `full = TRUE`

to compute (and later possibly plot) the full silhouette.

##### References

Rousseeuw, P.J. (1987)
Silhouettes: A graphical aid to the interpretation and validation of
cluster analysis. *J. Comput. Appl. Math.*, **20**, 53--65.

chapter 2 of Kaufman and Rousseeuw (1990), see
the references in `plot.agnes`

.

##### See Also

##### Examples

```
# NOT RUN {
data(ruspini)
pr4 <- pam(ruspini, 4)
str(si <- silhouette(pr4))
(ssi <- summary(si))
plot(si) # silhouette plot
plot(si, col = c("red", "green", "blue", "purple"))# with cluster-wise coloring
si2 <- silhouette(pr4$clustering, dist(ruspini, "canberra"))
summary(si2) # has small values: "canberra"'s fault
plot(si2, nmax= 80, cex.names=0.6)
op <- par(mfrow= c(3,2), oma= c(0,0, 3, 0),
mgp= c(1.6,.8,0), mar= .1+c(4,2,2,2))
for(k in 2:6)
plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE)
mtext("PAM(Ruspini) as in Kaufman & Rousseeuw, p.101",
outer = TRUE, font = par("font.main"), cex = par("cex.main")); frame()
## the same with cluster-wise colours:
c6 <- c("tomato", "forest green", "dark blue", "purple2", "goldenrod4", "gray20")
for(k in 2:6)
plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE,
col = c6[1:k])
par(op)
## clara(): standard silhouette is just for the best random subset
data(xclara)
set.seed(7)
str(xc1k <- xclara[ sample(nrow(xclara), size = 1000) ,]) # rownames == indices
cl3 <- clara(xc1k, 3)
plot(silhouette(cl3))# only of the "best" subset of 46
## The full silhouette: internally needs large (36 MB) dist object:
sf <- silhouette(cl3, full = TRUE) ## this is the same as
s.full <- silhouette(cl3$clustering, daisy(xc1k))
stopifnot(all.equal(sf, s.full, check.attributes = FALSE, tolerance = 0))
## color dependent on original "3 groups of each 1000": % __FIXME ??__
plot(sf, col = 2+ as.integer(names(cl3$clustering) ) %/% 1000,
main ="plot(silhouette(clara(.), full = TRUE))")
## Silhouette for a hierarchical clustering:
ar <- agnes(ruspini)
si3 <- silhouette(cutree(ar, k = 5), # k = 4 gave the same as pam() above
daisy(ruspini))
plot(si3, nmax = 80, cex.names = 0.5)
## 2 groups: Agnes() wasn't too good:
si4 <- silhouette(cutree(ar, k = 2), daisy(ruspini))
plot(si4, nmax = 80, cex.names = 0.5)
# }
```

*Documentation reproduced from package cluster, version 2.0.7-1, License: GPL (>= 2)*