ClusteringEntropy: Clustering entropy of all splits within a tree

Description

Sum the entropy (ClusteringEntropy()) or information content (ClusteringInfo()) across each split within a phylogenetic tree, treating each split as dividing the leaves of the tree into two clusters (sensu Meilă 2007; Vinh et al. 2010).

Usage

ClusteringEntropy(x)
ClusteringInfo(x)
# S3 method for phylo
ClusteringEntropy(x)
# S3 method for list
ClusteringEntropy(x)
# S3 method for multiPhylo
ClusteringEntropy(x)
# S3 method for Splits
ClusteringEntropy(x)
# S3 method for phylo
ClusteringInfo(x)
# S3 method for list
ClusteringInfo(x)
# S3 method for multiPhylo
ClusteringInfo(x)
# S3 method for Splits
ClusteringInfo(x)

Arguments

A tree of class phylo, a list of trees, or a multiPhylo object.

Value

Returns the sum of the entropies or (clustering) information content, in bits, of each split in x.

Details

Clustering entropy addresses the question "how much information is contained in the splits within a tree". Its approach is complementary to the phylogenetic information content, used in SplitwiseInfo(). In essence, it asks, given a split that subdivides the leaves of a tree into two partitions, how easy it is to predict which partition a randomly drawn leaf belongs to.

Formally, the entropy of a split S that divides n leaves into two partitions of sizes a and b is given by H(S) = - a/n log a/n - b/n log b/n.

Base 2 logarithms are conventionally used, such that entropy is measured in bits. Entropy denotes the number of bits that are necessary to encode the outcome of a random variable: here, the random variable is "what partition does a randomly selected leaf belong to".

An even split has an entropy of 1 bit: there is no better way of encoding an outcome than using one bit to specify which of the two partitions the randomly selected leaf belongs to.

An uneven split has a lower entropy: membership of the larger partition is common, and thus less surprising; it can be signified using fewer bits in an optimal compression system.

If this sounds confusing, let's consider creating a code to transmit the cluster label of two randomly selected leaves. One straightforward option would be to use

00 = 'Both leaves belong to partition A'
11 = 'Both leaves belong to partition B'
01 = 'First leaf in A, second in B`
10 = 'First leaf in B, second in A`

This code uses two bits to transmit the partition labels of two leaves. If partitions A and B are equiprobable, this is the optimal code; our entropy -- the average information content required per leaf -- is 1 bit.

Alternatively, we could use the (suboptimal) code

0 = 'Both leaves belong to partition A'
111 = 'Both leaves belong to partition B'
101 = 'First leaf in A, second in B`
110 = 'First leaf in B, second in A`

If A is much larger than B, then most pairs of leaves will require just a single bit (code 0). The additional bits when 1+ leaf belongs to B may be required sufficiently rarely that the average message requires fewer than two bits for two leaves, so the entropy is less than 1 bit. (The optimal coding strategy will depend on the exact sizes of A and B.)

As entropy measures the bits required to transmit the cluster label of each leaf (Vinh et al. 2010: p. 2840), the information content of a split is its entropy multiplied by the number of leaves.

References

Meila2007TreeDist
Vinh2010TreeDist

Examples

Run this code

# NOT RUN {
# Clustering entropy of an even split = 1 bit
ClusteringEntropy(TreeTools::as.Splits(c(rep(TRUE, 4), rep(FALSE, 4))))

# Clustering entropy of an uneven split < 1 bit
ClusteringEntropy(TreeTools::as.Splits(c(rep(TRUE, 2), rep(FALSE, 6))))

tree1 <- TreeTools::BalancedTree(8)
tree2 <- TreeTools::PectinateTree(8)

ClusteringInfo(tree1)
ClusteringEntropy(tree1)
ClusteringInfo(list(one = tree1, two = tree2))

ClusteringInfo(tree1) + ClusteringInfo(tree2)
ClusteringEntropy(tree1) + ClusteringEntropy(tree2)
ClusteringInfoDistance(tree1, tree2)
MutualClusteringInfo(tree1, tree2)
# }

Run the code above in your browser using DataLab