Clustering entropy addresses the question "how much information is contained
in the splits within a tree". Its approach is complementary to the
phylogenetic information content, used in SplitwiseInfo()
.
In essence, it asks, given a split that subdivides the leaves of a tree into
two partitions, how easy it is to predict which partition a randomly drawn
leaf belongs to.
Formally, the entropy of a split S that divides n leaves into two
partitions of sizes a and b is given by
H(S) = - a/n log a/n - b/n log b/n.
Base 2 logarithms are conventionally used, such that entropy is measured in
bits.
Entropy denotes the number of bits that are necessary to encode the outcome
of a random variable: here, the random variable is "what partition does a
randomly selected leaf belong to".
An even split has an entropy of 1 bit: there is no better way of encoding
an outcome than using one bit to specify which of the two partitions the
randomly selected leaf belongs to.
An uneven split has a lower entropy: membership of the larger partition is
common, and thus less surprising; it can be signified using fewer bits in an
optimal compression system.
If this sounds confusing, let's consider creating a code to transmit the
cluster label of two randomly selected leaves. One straightforward
option would be to use
00
= 'Both leaves belong to partition A'
11
= 'Both leaves belong to partition B'
01
= 'First leaf in A, second in B`
10
= 'First leaf in B, second in A`
This code uses two bits to transmit the partition labels of two leaves.
If partitions A and B are equiprobable, this is the optimal code; our
entropy -- the average information content required per leaf -- is 1 bit.
Alternatively, we could use the (suboptimal) code
0
= 'Both leaves belong to partition A'
111
= 'Both leaves belong to partition B'
101
= 'First leaf in A, second in B`
110
= 'First leaf in B, second in A`
If A is much larger than B, then most pairs of leaves will require just
a single bit (code 0
). The additional bits when 1+ leaf belongs to B
may be required sufficiently rarely that the average message
requires fewer than two bits for two leaves, so the entropy is less than
1 bit. (The optimal coding strategy will depend on the exact sizes
of A and B.)
As entropy measures the bits required to transmit the cluster label of each
leaf (Vinh et al. 2010: p. 2840), the information content of a split is its
entropy multiplied by the number of leaves.