fastcluster: Fast hierarchical, agglomerative clustering routines for R and Python

Description

The fastcluster package provides efficient algorithms for hierarchical, agglomerative clustering. In addition to the R interface, there is also a Python interface to the underlying C++ library, to be found in the source distribution.

Arguments

Author

Daniel Müllner

Details

The function hclust provides clustering when the input is a dissimilarity matrix. A dissimilarity matrix can be computed from vector data by dist. The hclust function can be used as a drop-in replacement for existing routines: stats::hclust and flashClust::hclust alias flashClust::flashClust. Once the fastcluster library is loaded at the beginning of the code, every program that uses hierarchical clustering can benefit immediately and effortlessly from the performance gain

When the package is loaded, it overwrites the function hclust with the new code.

The function hclust.vector provides memory-saving routines when the input is vector data.

Further information:

R documentation pages: hclust, hclust.vector
A comprehensive User's manual: fastcluster.pdf. Get this from the R command line with vignette('fastcluster').
JSS paper: https://www.jstatsoft.org/v53/i09/.
See the author's home page for a performance comparison: http://danifold.net/fastcluster.html.

References

http://danifold.net/fastcluster.html

Examples

Run this code

# Taken and modified from stats::hclust
#
# hclust(...)        # new method
# hclust.vector(...) # new method
# stats::hclust(...) # old method

require(fastcluster)
require(graphics)

hc <- hclust(dist(USArrests), "ave")
plot(hc)
plot(hc, hang = -1)

## Do the same with centroid clustering and squared Euclidean distance,
## cut the tree into ten clusters and reconstruct the upper part of the
## tree from the cluster centers.
hc <- hclust.vector(USArrests, "cen")
# squared Euclidean distances
hc$height <- hc$height^2
memb <- cutree(hc, k = 10)
cent <- NULL
for(k in 1:10){
  cent <- rbind(cent, colMeans(USArrests[memb == k, , drop = FALSE]))
}
hc1 <- hclust.vector(cent, method = "cen", members = table(memb))
# squared Euclidean distances
hc1$height <- hc1$height^2
opar <- par(mfrow = c(1, 2))
plot(hc,  labels = FALSE, hang = -1, main = "Original Tree")
plot(hc1, labels = FALSE, hang = -1, main = "Re-start from 10 clusters")
par(opar)

Run the code above in your browser using DataLab