Agglomerative hierarchical clustering of a matrix of dissimilarities.
linkage(prox, method = "arithmetic", weighted = FALSE,
par.method = 0, digits = NULL)
Object of class "dist"
containing the lower triangle
of a proximity matrix in the form of distances.
Character string specifying the linkage method to be used. This should be
one of: "versatile"
, "single"
, "complete"
,
"arithmetic"
(default), "geometric"
, "harmonic"
,
"ward"
, "centroid"
or "flexible"
. See the
Details section.
Logical to choose between the weighted and the unweighted (default)
versions of some clustering methods. Weighted clustering gives merging
branches in a hierarchical tree equal weight regardless of the number of
individuals carried on each branch. Such a procedure weights the
individuals unequally, contrasting with unweighted clustering that gives
equal weight to each individual in the clusters. This parameter has no
effect on the "single"
, "complete"
and "ward"
linkages.
A real value in the range [-1, 1] required as parameter for the methods
"versatile"
and "flexible"
. See the Details section.
Integer specifying the precision, i.e. the number of significant decimal
digits of the data and for the calculations. This is a very important
parameter, since equal proximity values at a certain precision may become
different by increasing its value. Thus, it may be responsible of the
existence of tied distances. The rule should be not to use a precision
larger than the resolution given by the experimental setup that has
generated the data. If digits=NULL
(default), then the precision is
set to that of the data value with the largest number of significant
decimal digits.
Returns an object of class "dendrogram"
.
Starting from a matrix of dissimilarities, linkage()
calculates
its dendrogram with the most commonly used agglomerative hierarchical
clustering methods, e.g. single linkage, complete linkage, arithmetic
linkage (also known as average linkage) and Ward's method. You can also
choose between the weighted and the unweighted versions of some clustering
methods, e.g. weighted centroid (WPGMC) and unweighted centroid (UPGMC).
Importantly, it contains a new parameterized method named versatile linkage,
which includes single linkage, complete linkage and average linkage as
particular cases, and which naturally defines two new methods, geometric
linkage and harmonic linkage (hence the convenience to rename average linkage
as arithmetic linkage, to emphasize the existence of different types of
averages).
The difference between the available hierarchical clustering methods rests in the way the distance between clusters is defined. During the agglomeration process, the data items are iteratively joined to form clusters, merging first the clusters that are at the minimum distance. However, given two clusters, each one formed by several data observations, there exist many ways of defining the distance between the clusters from the dissimilarities between their constituent individuals. Among these linkage methods, we have the following ones:
"single"
: the distance between clusters equals the minimum
distance between individuals.
"complete"
: the distance between clusters equals the maximum
distance between individuals.
"arithmetic"
: the distance between clusters equals the
arithmetic mean distance between individuals. Also known as average
linkage, WPGMA (weighted version) or UPGMA (unweighted version).
"geometric"
: the distance between clusters equals the
geometric mean distance between individuals.
"harmonic"
: the distance between clusters equals the
harmonic mean distance between individuals.
"versatile"
: the distance between clusters equals the
generalized power mean distance between individuals. It depends on the
value of par.method
, with the following linkage methods as
particular cases: "complete"
(par.method=+1
),
"arithmetic"
(par.method=+0.1
), "geometric"
(par.method=0
), "harmonic"
(par.method=-0.1
) and
"single"
(par.method=-1
).
"ward"
: the distance between clusters is a weighted squared
Euclidean distance between the centroids of each cluster (Ward, 1963).
"centroid"
: the distance between clusters equals the square
of the Euclidean distance between the centroids of each cluster. Also
known as WPGMC (weighted version) or UPGMC (unweighted version).
"flexible"
: the distance between clusters is a weighted sum
of the distances between clusters in the previous iteration (Lance and
Williams, 1967; Belbin et al., 1992). It depends on the value of
par.method
, and it is equivalent to "arithmetic"
linkage
when par.method=0
.
Except for the cases containing ties in proximity values as described in the
next paragraph, the following equivalences hold between the
linkage()
function in this package, the
hclust()
function in the stats package, and the
agnes()
function in the cluster package. When
relevant, weighted (W
) or unweighted (U
) versions of the
linkage methods and the values for par.method
(\(\beta\)) are
indicated:
linkage |
hclust |
agnes |
================== |
============ |
=================== |
"single" |
"single" |
"single" |
"complete" |
"complete" |
"complete" |
"arithmetic", U |
"average" |
"average" |
"arithmetic", W |
"mcquitty" |
"weighted" |
"ward" |
"ward.D2" |
"ward" |
"centroid", U |
"centroid" |
-------- |
"centroid", W |
"median" |
-------- |
"flexible", U, \(\beta\) |
-------- |
"gaverage", \(\beta\) |
"flexible", W, \(\beta\) |
-------- |
"flexible", \((1-\beta)/2\) |
linkage()
implements the variable-group approach introduced in
Fernandez and Gomez (2008) to solve the non-uniqueness problem found in the
pair-group implementations. This problem arises when two or more minimum
distances between different clusters are equal during the amalgamation
process. The pair-group approach consists in choosing a pair, breaking the
ties between distances, and proceeds in the same way until the final
hierarchical classification is obtained. However, different dendrograms are
possible depending on the criterion used to break the ties (usually a pair is
just chosen at random). The variable-group approach groups more than two
clusters at the same time when ties occur, what always produces a uniquely
determined solution. When there are no ties, the variable-group approach
gives the same results as the pair-group one.
L. Belbin, D.P. Faith and G.W. Milligan (1992). A comparison of two approaches to beta-flexible clustering. Multivariate Behavioral Research, 27(3):417-433.
A. Fern<U+00E1>ndez and S. G<U+00F3>mez (2008). Solving non-uniqueness in agglomerative hierarchical clustering using multidendrograms. Journal of Classification, 25(1):43-65.
G.N. Lance and W.T. Williams (1967). A general theory of classificatory sorting strategies: 1. Hierarchical systems. The Computer Journal, 9(4):373-380.
J.H. Ward (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236-244.
dendesc
for descriptive measures to analyze dendrograms.
# NOT RUN {
## distances between 10 cities in the US
data(UScitiesD)
## unweighted arithmetic linkage (UPGMA)
lnk1 <- linkage(UScitiesD, method="arithmetic", weighted=FALSE)
plot(lnk1, main="linkage(arithmetic, U)")
## weighted arithmetic linkage (WPGMA)
lnk2 <- linkage(UScitiesD, method="arithmetic", weighted=TRUE)
## equivalence with hclust, except for the ordering of the leaves
hcl2 <- as.dendrogram(hclust(UScitiesD, method="mcquitty"))
sum(abs(ultrametric(lnk2) - ultrametric(hcl2))) # 0
opar <- par(mfrow=c(1, 2))
plot(lnk2, main="linkage(arithmetic, W)")
plot(hcl2, main="hclust(mcquitty)")
par(opar)
## unweighted versatile linkage, with par.method=-0.6
lnk3 <- linkage(UScitiesD, method="versatile", weighted=FALSE,
par.method=-0.6)
plot(lnk3, main="linkage(versatile, -0.6, U)")
## cophenetic correlation coefficient
cor(UScitiesD, ultrametric(lnk1)) # 0.8101937
cor(UScitiesD, ultrametric(lnk2)) # 0.8076422
cor(UScitiesD, ultrametric(lnk3)) # 0.8163286
# }
Run the code above in your browser using DataLab