hcluster: Hierarchical Clustering

Description

Hierarchical cluster analysis.

Usage

hcluster(x, method = "euclidean", diag = FALSE, upper = FALSE,
         link = "complete", members = NULL, nbproc = 2,
         doubleprecision = TRUE)

Value

An object of class hclust which describes the tree produced by the clustering process. The object is a list with components:

merge: an \(n-1\) by 2 matrix. Row \(i\) of merge describes the merging of clusters at step \(i\) of the clustering. If an element \(j\) in the row is negative, then observation \(-j\) was merged at this stage. If \(j\) is positive then the merge was with the cluster formed at the (earlier) stage \(j\) of the algorithm. Thus negative entries in merge indicate agglomerations of singletons, and positive entries indicate agglomerations of non-singletons.
height: a set of \(n-1\) non-decreasing real values. The clustering height: that is, the value of the criterion associated with the clustering method for the particular agglomeration.
order: a vector giving the permutation of the original observations suitable for plotting, in the sense that a cluster plot using this ordering and matrix merge will not have crossings of the branches.
labels: labels for each of the objects being clustered.
call: the call which produced the result.
method: the cluster method that has been used.
dist.method: the distance that has been used to create d (only returned if the distance object has a "method" attribute).

There is a print and a plot method for

hclust objects. The plclust() function is basically the same as the plot method,

plot.hclust, primarily for back compatibility with S-plus. Its extra arguments are not yet implemented.

Arguments

x: A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns). Or an object of class "exprSet".
method: the distance measure to be used. This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary", "pearson", "abspearson", "correlation", "abscorrelation", "spearman" or "kendall". Any unambiguous substring can be given.
diag: logical value indicating whether the diagonal of the distance matrix should be printed by print.dist.
upper: logical value indicating whether the upper triangle of the distance matrix should be printed by print.dist.
link: the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward", "single", "complete", "average", "mcquitty", "median" or "centroid","centroid2".
members: NULL or a vector with length size of d.
nbproc: integer, number of subprocess for parallelization [Linux & Mac only]
doubleprecision: True: use of double precision for distance matrix computation; False: use simple precision

Author

The hcluster function is based on C code adapted from Cran Fortran routine by Antoine Lucas.

Details

This function is a mix of function hclust and function dist. hcluster(x, method = "euclidean",link = "complete") = hclust(dist(x, method = "euclidean"),method = "complete")) It use twice less memory, as it doesn't store distance matrix.

For more details, see documentation of hclust and Dist.

References

Antoine Lucas and Sylvain Jasson, Using amap and ctc Packages for Huge Clustering, R News, 2006, vol 6, issue 5 pages 58-60.

Examples

Run this code


data(USArrests)
hc <- hcluster(USArrests,link = "ave")
plot(hc)
plot(hc, hang = -1)

## Do the same with centroid clustering and squared Euclidean distance,
## cut the tree into ten clusters and reconstruct the upper part of the
## tree from the cluster centers.
hc <- hclust(dist(USArrests)^2, "cen")
memb <- cutree(hc, k = 10)
cent <- NULL
for(k in 1:10){
  cent <- rbind(cent, colMeans(USArrests[memb == k, , drop = FALSE]))
}
hc1 <- hclust(dist(cent)^2, method = "cen", members = table(memb))
opar <- par(mfrow = c(1, 2))
plot(hc,  labels = FALSE, hang = -1, main = "Original Tree")
plot(hc1, labels = FALSE, hang = -1, main = "Re-start from 10 clusters")
par(opar)


## other combinaison are possible

hc <- hcluster(USArrests,method = "euc",link = "ward", nbproc= 1,
doubleprecision = TRUE)
hc <- hcluster(USArrests,method = "max",link = "single", nbproc= 2,
doubleprecision = TRUE)
hc <- hcluster(USArrests,method = "man",link = "complete", nbproc= 1,
doubleprecision = TRUE)
hc <- hcluster(USArrests,method = "can",link = "average", nbproc= 2,
doubleprecision = TRUE)
hc <- hcluster(USArrests,method = "bin",link = "mcquitty", nbproc= 1,
doubleprecision = FALSE)
hc <- hcluster(USArrests,method = "pea",link = "median", nbproc= 2,
doubleprecision = FALSE)
hc <- hcluster(USArrests,method = "abspea",link = "median", nbproc= 2,
doubleprecision = FALSE)
hc <- hcluster(USArrests,method = "cor",link = "centroid", nbproc= 1,
doubleprecision = FALSE)
hc <- hcluster(USArrests,method = "abscor",link = "centroid", nbproc= 1,
doubleprecision = FALSE)
hc <- hcluster(USArrests,method = "spe",link = "complete", nbproc= 2,
doubleprecision = FALSE)
hc <- hcluster(USArrests,method = "ken",link = "complete", nbproc= 2,
doubleprecision = FALSE)

Run the code above in your browser using DataLab