Learn R Programming

phyloregion (version 1.0.2)

optimal_phyloregion: Determine optimal number of clusters

Description

This function divides the hierarchical dendrogram into meaningful clusters ("phyloregions"), based on the <U+2018>elbow<U+2019> or <U+2018>knee<U+2019> of an evaluation graph that corresponds to the point of optimal curvature.

Usage

optimal_phyloregion(x, method = "average", k = 20)

Arguments

x

a numeric matrix, data frame or “dist” object.

method

the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC).

k

numeric, the upper bound of the number of clusters to compute. DEFAULT: 20 or the number of observations (if less than 20).

Value

a list containing the following as returned from the GMD package (Zhao et al. 2011):

  • k: optimal number of clusters (bioregions)

  • totbss: total between-cluster sum-of-square

  • tss: total sum of squares of the data

  • ev: explained variance given k

References

Salvador, S. & Chan, P. (2004) Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. Proceedings of the Sixteenth IEEE International Conference on Tools with Artificial Intelligence, pp. 576<U+2013>584. Institute of Electrical and Electronics Engineers, Piscataway, New Jersey, USA.

Zhao, X., Valen, E., Parker, B.J. & Sandelin, A. (2011) Systematic clustering of transcription start site landscapes. PLoS ONE 6: e23409.

Examples

Run this code
# NOT RUN {
data(africa)
tree <- africa$phylo
bc <- beta_diss(africa$comm)
(d <- optimal_phyloregion(bc[[1]]))
plot(d$df$k, d$df$ev, ylab = "Explained variances",
  xlab = "Number of clusters")
lines(d$df$k[order(d$df$k)], d$df$ev[order(d$df$k)], pch = 1)
points(d$optimal$k, d$optimal$ev, pch = 21, bg = "red", cex = 3)
points(d$optimal$k, d$optimal$ev, pch = 21, bg = "red", type = "h")
# }

Run the code above in your browser using DataLab