Learn R Programming

dendextend (version 0.14.2)

Bk: Bk - Calculating Fowlkes-Mallows Index for two dendrogram

Description

Bk is the calculation of Fowlkes-Mallows index for a series of k cuts for two dendrograms.

Usage

Bk(tree1, tree2, k, include_EV = TRUE, warn = TRUE, ...)

Arguments

tree1
a dendrogram/hclust/phylo object.
tree2
a dendrogram/hclust/phylo object.
k
an integer scalar or vector with the desired number of cluster groups. If missing - the Bk will be calculated for a default k range of 2:(nleaves-1). No point in checking k=1/k=n, since both will give Bk=1.
include_EV
logical (TRUE). Should we calculate expectancy and variance of the FM Index under null hypothesis of no relation between the clusterings? If TRUE (Default) - then the FM_index_R function, else (FALSE) we
warn
logical (TRUE). Should a warning be issued in case of problems? If set to TRUE, extra checks are made to varify that the two clusters have the same size and the same labels.
...
Ignored (passed to FM_index_R/FM_index_profdpm).

Value

  • A list (of k's length) of Fowlkes-Mallows index between two dendrogram for a scalar/vector of k values. The names of the lists' items is the k for which it was calculated.

Details

From Wikipedia: Fowlkes-Mallows index (see references) is an external evaluation method that is used to determine the similarity between two clusterings (clusters obtained after a clustering algorithm). This measure of similarity could be either between two hierarchical clusterings or a clustering and a benchmark classification. A higher the value for the Fowlkes-Mallows index indicates a greater similarity between the clusters and the benchmark classifications.

References

Fowlkes, E. B.; Mallows, C. L. (1 September 1983). "A Method for Comparing Two Hierarchical Clusterings". Journal of the American Statistical Association 78 (383): 553. http://en.wikipedia.org/wiki/Fowlkes-Mallows_index

See Also

FM_index, cor_bakers_gamma

Examples

Run this code
set.seed(23235)
ss <- TRUE # sample(1:150, 10 )
hc1 <- hclust(dist(iris[ss,-5]), "com")
hc2 <- hclust(dist(iris[ss,-5]), "single")
tree1 <- as.dendrogram(hc1)
tree2 <- as.dendrogram(hc2)
#    cutree(tree1)

Bk(hc1, hc2, k = 3)
Bk(hc1, hc2, k = 2:10)
Bk(hc1, hc2)

Bk(tree1, tree2, k = 3)
Bk(tree1, tree2, k = 2:5)

system.time(Bk(hc1, hc2, k = 2:5)) # 0.01
system.time(Bk(hc1, hc2)) # 1.28
system.time(Bk(tree1, tree2, k = 2:5)) # 0.24 # after fixes.
system.time(Bk(tree1, tree2, k = 2:10)) # 0.31 # after fixes.
system.time(Bk(tree1, tree2)) # 7.85
Bk(tree1, tree2, k= 99:101)

y <- Bk(hc1, hc2, k = 2:10)
plot(unlist(y)~c(2:10), type = "b", ylim = c(0,1))

# can take a few seconds
y <- Bk(hc1, hc2)
plot(unlist(y)~as.numeric(names(y)),
     main = "Bk plot", pch = 20,
     xlab = "k", ylab = "FM Index",
     type = "b", ylim = c(0,1))
# we are still missing some hypothesis testing here.
# for this we'll have the Bk_plot function.

Run the code above in your browser using DataLab