# silhouette

##### Compute or Extract Silhouette Information from Clustering

Compute silhouette information according to a given clustering in $k$ clusters.

- Keywords
- cluster

##### Usage

```
silhouette(x, ...)
## S3 method for class 'default':
silhouette(x, dist, dmatrix, \dots)
## S3 method for class 'partition':
silhouette(x, \dots)
## S3 method for class 'clara':
silhouette(x, full = FALSE, \dots)
```sortSilhouette(object, ...)
## S3 method for class 'silhouette':
summary(object, FUN = mean, \dots)
## S3 method for class 'silhouette':
plot(x, nmax.lab = 40, max.strlen = 5,
main = NULL, sub = NULL, xlab = expression("Silhouette width "* s[i]),
col = "gray", do.col.sort = length(col) > 1, border = 0,
cex.names = par("cex.axis"), do.n.k = TRUE, do.clus.stat = TRUE, ...)

##### Arguments

- x
- an object of appropriate class; for the
`default`

method an integer vector with $k$ different integer cluster codes or a list with such an`x$clustering`

component. Note that silhouette statistics are only defined if - dist
- a dissimilarity object inheriting from class
`dist`

or coercible to one. If not specified,`dmatrix`

must be. - dmatrix
- a symmetric dissimilarity matrix ($n \times n$),
specified instead of
`dist`

, which can be more efficient. - full
- logical specifying if a
*full*silhouette should be computed for`clara`

object. Note that this requires $O(n^2)$ memory, since the full dissimilarity (see - object
- an object of class
`silhouette`

. - ...
- further arguments passed to and from methods.
- FUN
- function used to summarize silhouette widths.
- nmax.lab
- integer indicating the number of labels which is considered too large for single-name labeling the silhouette plot.
- max.strlen
- positive integer giving the length to which strings are truncated in silhouette plot labeling.
- main, sub, xlab
- arguments to
`title`

; have a sensible non-NULL default here. - col, border, cex.names
- arguments passed
`barplot()`

; note that the default used to be`col = heat.colors(n), border = par("fg")`

instead.`col`

can also be a color vector of length $k$ for cluste - do.col.sort
- logical indicating if the colors
`col`

should be sortedalong the silhouette; this is useful for casewise or clusterwise coloring. - do.n.k
- logical indicating if $n$ and $k$
title text should be written. - do.clus.stat
- logical indicating if cluster size and averages should be written right to the silhouettes.

##### Details

For each observation i, the *silhouette width* $s(i)$ is
defined as follows:
Put a(i) = average dissimilarity between i and all other points of the
cluster to which i belongs (if i is the *only* observation in
its cluster, $s(i) := 0$ without further calculations).
For all *other* clusters C, put $d(i,C)$ = average
dissimilarity of i to all observations of C. The smallest of these
$d(i,C)$ is $b(i) := \min_C d(i,C)$,
and can be seen as the dissimilarity between i and its *not* belong.
Finally, $$s(i) := \frac{b(i) - a(i) }{max(a(i), b(i))}.$$

`silhouette.default()`

is now based on C code donated by Romain
Francois (the R version being still available as
`cluster:::silhouette.default.R`

).

Observations with a large $s(i)$ (almost 1) are very well clustered, a small $s(i)$ (around 0) means that the observation lies between two clusters, and observations with a negative $s(i)$ are probably placed in the wrong cluster.

##### Value

`silhouette()`

returns an object,`sil`

, of class`silhouette`

which is an [n x 3] matrix with attributes. For each observation i,`sil[i,]`

contains the cluster to which i belongs as well as the neighbor cluster of i (the cluster, not containing i, for which the average dissimilarity between its observations and i is minimal), and the silhouette width $s(i)$ of the observation. The`colnames`

correspondingly are`c("cluster", "neighbor", "sil_width")`

.`summary(sil)`

returns an object of class`summary.silhouette`

, a list with componentssi.summary numerical `summary`

of the individual silhouette widths $s(i)$.clus.avg.widths numeric (rank 1) array of clusterwise *means*of silhouette widths where`mean = FUN`

is used.avg.width the total mean `FUN(s)`

where`s`

are the individual silhouette widths.clus.sizes `table`

of the $k$ cluster sizes.call if available, the call creating `sil`

.Ordered logical identical to `attr(sil, "Ordered")`

, see below.`sortSilhouette(sil)`

orders the rows of`sil`

as in the silhouette plot, by cluster (increasingly) and decreasing silhouette width $s(i)$.`attr(sil, "Ordered")`

is a logical indicating if`sil`

*is*ordered as by`sortSilhouette()`

. In that case,`rownames(sil)`

will contain case labels or numbers, and`attr(sil, "iOrd")`

the ordering index vector.

##### Note

While `silhouette()`

is *intrinsic* to the
`partition`

clusterings, and hence has a (trivial) method
for these, it is straightforward to get silhouettes from hierarchical
clusterings from `silhouette.default()`

with
`cutree()`

and distance as input.

By default, for `clara()`

partitions, the silhouette is
just for the best random *subset* used. Use `full = TRUE`

to compute (and later possibly plot) the full silhouette.

##### References

Rousseeuw, P.J. (1987)
Silhouettes: A graphical aid to the interpretation and validation of
cluster analysis. *J. Comput. Appl. Math.*, **20**, 53--65.

chapter 2 of Kaufman, L. and Rousseeuw, P.J. (1990), see
the references in `plot.agnes`

.

##### See Also

##### Examples

```
data(ruspini)
pr4 <- pam(ruspini, 4)
str(si <- silhouette(pr4))
(ssi <- summary(si))
plot(si) # silhouette plot
plot(si, col = c("red", "green", "blue", "purple"))# with cluster-wise coloring
si2 <- silhouette(pr4$clustering, dist(ruspini, "canberra"))
summary(si2) # has small values: "canberra"'s fault
plot(si2, nmax= 80, cex.names=0.6)
op <- par(mfrow= c(3,2), oma= c(0,0, 3, 0),
mgp= c(1.6,.8,0), mar= .1+c(4,2,2,2))
for(k in 2:6)
plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE)
mtext("PAM(Ruspini) as in Kaufman & Rousseeuw, p.101",
outer = TRUE, font = par("font.main"), cex = par("cex.main")); frame()
## the same with cluster-wise colours:
c6 <- c("tomato", "forest green", "dark blue", "purple2", "goldenrod4", "gray20")
for(k in 2:6)
plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE,
col = c6[1:k])
par(op)
## clara(): standard silhouette is just for the best random subset
data(xclara)
set.seed(7)
str(xc1k <- xclara[sample(nrow(xclara), size = 1000) ,])
cl3 <- clara(xc1k, 3)
plot(silhouette(cl3))# only of the "best" subset of 46
## The full silhouette: internally needs large (36 MB) dist object:
sf <- silhouette(cl3, full = TRUE) ## this is the same as
s.full <- silhouette(cl3$clustering, daisy(xc1k))
if(paste(R.version$major, R.version$minor, sep=".") >= "2.3.0")
stopifnot(all.equal(sf, s.full, check.attributes = FALSE, tol = 0))
## color dependent on original "3 groups of each 1000":
plot(sf, col = 2+ as.integer(names(cl3$clustering) ) %/% 1000,
main ="plot(silhouette(clara(.), full = TRUE))")
## Silhouette for a hierarchical clustering:
ar <- agnes(ruspini)
si3 <- silhouette(cutree(ar, k = 5), # k = 4 gave the same as pam() above
daisy(ruspini))
plot(si3, nmax = 80, cex.names = 0.5)
## 2 groups: Agnes() wasn't too good:
si4 <- silhouette(cutree(ar, k = 2), daisy(ruspini))
plot(si4, nmax = 80, cex.names = 0.5)
```

*Documentation reproduced from package cluster, version 1.14.4, License: GPL (>= 2)*