Learn R Programming

energy (version 1.6.2)

energy.hclust: Hierarchical Clustering by Minimum (Energy) E-distance

Description

Performs hierarchical clustering by minimum (energy) E-distance method.

Usage

energy.hclust(dst, alpha = 1)

Arguments

dst
Euclidean distances in a dist object, or a distance matrix produced by dist, or lower triangle of distance matrix as vector in column order. If dst is a square matrix, the lower triangle is inter
alpha
distance exponent

Value

  • An object of class hclust which describes the tree produced by the clustering process. The object is a list with components:
  • merge:an n-1 by 2 matrix, where row i of merge describes the merging of clusters at step i of the clustering. If an element j in the row is negative, then observation -j was merged at this stage. If j is positive then the merge was with the cluster formed at the (earlier) stage j of the algorithm.
  • height:the clustering height: a vector of n-1 non-decreasing real numbers (the e-distance between merging clusters)
  • order:a vector giving a permutation of the indices of original observations suitable for plotting, in the sense that a cluster plot using this ordering and matrix merge will not have crossings of the branches.
  • labels:labels for each of the objects being clustered.
  • call:the call which produced the result.
  • method:the cluster method that has been used (e-distance).
  • dist.method:the distance that has been used to create dst.

concept

energy statistics

Details

Dissimilarities are $d(x,y) = \|x-y\|^\alpha$, where the exponent $\alpha$ is in the interval (0,2]. This function performs agglomerative hierarchical clustering. Initially, each of the n singletons is a cluster. At each of n-1 steps, the procedure merges the pair of clusters with minimum e-distance. The e-distance between two clusters $C_i, C_j$ of sizes $n_i, n_j$ is given by $$e(C_i, C_j)=\frac{n_i n_j}{n_i+n_j}[2M_{ij}-M_{ii}-M_{jj}],$$ where $$M_{ij}=\frac{1}{n_i n_j}\sum_{p=1}^{n_i} \sum_{q=1}^{n_j} \|X_{ip}-X_{jq}\|^\alpha,$$ $\|\cdot\|$ denotes Euclidean norm, and $X_{ip}$ denotes the p-th observation in the i-th cluster. The return value is an object of class hclust, so hclust methods such as print or plot methods, plclust, and cutree are available. See the documentation for hclust. The e-distance measures both the heterogeneity between clusters and the homogeneity within clusters. $\mathcal E$-clustering ($\alpha=1$) is particularly effective in high dimension, and is more effective than some standard hierarchical methods when clusters have equal means (see example below). For other advantages see the references.

References

Szekely, G. J. and Rizzo, M. L. (2005) Hierarchical Clustering via Joint Between-Within Distances: Extending Ward's Minimum Variance Method, Journal of Classification 22(2) 151-183. http://dx.doi.org/10.1007/s00357-005-0012-9 Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension, InterStat, November (5). Szekely, G. J. (2000) Technical Report 03-05: $\mathcal{E}$-statistics: Energy of Statistical Samples, Department of Mathematics and Statistics, Bowling Green State University.

See Also

edist ksample.e eqdist.etest hclust

Examples

Run this code
library(cluster)
   data(animals)
   plot(energy.hclust(dist(animals)))
   
   data(USArrests)
   ecl <- energy.hclust(dist(USArrests))
   print(ecl)    
   plot(ecl)
   cutree(ecl, k=3)
   cutree(ecl, h=150)
   
   ## compare performance of e-clustering, Ward's method, group average method
   ## when sampled populations have equal means: n=200, d=5, two groups
   z <- rbind(matrix(rnorm(1000), nrow=200), matrix(rnorm(1000, 0, 5), nrow=200))
   g <- c(rep(1, 200), rep(2, 200))
   d <- dist(z)
   e <- energy.hclust(d)
   a <- hclust(d, method="average")
   w <- hclust(d^2, method="ward")
   list("E" = table(cutree(e, k=2) == g), "Ward" = table(cutree(w, k=2) == g),
    "Avg" = table(cutree(a, k=2) == g))

Run the code above in your browser using DataLab