partition.loss: Compute Partition Loss or the Expectation of Partition Loss

Description

Given two partitions \(\pi*\) and \(\pi\), these functions compute the specified loss when using \(\pi*\) to estimate \(\pi\). Smaller loss values indicate higher concordance between partitions. These functions also compute a Monte Carlo estimate of the expectation for the specified loss based on samples or a pairwise similarity matrix. This function also supports computing approximations to the expectation of several losses. Supported criteria are described below. Some criteria only require the pairwise similarity matrix (as computed, for example, by psm) whereas others require samples from a partition distribution. For those criteria that only need the pairwise similarity matrix, posterior samples can still be provided in the x argument and the pairwise similarity matrix will automatically be computed as needed.

Usage

partition.loss(partitions, x, loss = VI())
binder(partitions, x, a = 1)
RI(partition1, partition2)
omARI(partitions, x)
omARI.approx(partitions, x)
ARI(partition1, partition2)
VI(partitions, x, a = 1)
VI.lb(partitions, x)
NVI(partitions, x)
ID(partitions, x)
NID(partitions, x)

Arguments

partitions

An integer matrix of cluster labels with \(n\) columns, where each row is a partition of \(n\) items given as cluster labels. Two items are in the same subset (i.e., cluster) if their labels are equal. Or, a vector of length \(n\) which will be converted to a \(1\)-by-\(n\) matrix.

A \(B\)-by-\(n\) matrix, where each of the \(B\) rows represents a clustering of \(n\) items using cluster labels. For the \(b\)th clustering, items \(i\) and \(j\) are in the same cluster if x[b,i] == x[b,j].

loss

The loss function to use, as indicated by "binder", "omARI", "VI", "NVI", "ID", "NID", or the result of calling a function with these names. Also supported are "binder.psm", "VI.lb", "omARI.approx", or the result of calling a function with these names, in which case x above can optionally be a pairwise similarity matrix, i.e., \(n\)-by-\(n\) symmetric matrix whose \((i,j)\) element gives the (estimated) probability that items \(i\) and \(j\) are in the same subset (i.e., cluster) of a partition (i.e., clustering).

(Only used for Binder and VI loss) Without loss of generality, the cost under Binder loss of placing two items in the same cluster when in truth they belong to separate clusters is fixed at one. The argument a is either: i. a nonnegative scalar giving the cost of the complementary mistake, i.e., placing two items in separate clusters when in truth they belong to the same cluster, or ii. a list containing the desired number of clusters ("nClusters") and an upper bound ("upper") when searching for "a" that yields the desired number of clusters. In the second case, the desired number of clusters may not be achievable without modifying maxSize in the salso function. To increase the probability of hitting exactly the desired number of clusters, the nRuns in the salso function may need to be increased.

partition1

An integer vector of cluster labels for \(n\) items. Two items are in the same subset (i.e., cluster) if their labels are equal.

partition2

An integer vector of cluster labels having the same length as partition1.

Value

A numeric vector.

Details

The partition estimation criterion can be specified using the loss argument, which is either a string or a result of calling the associated functions. These losses are described below:

"binder": Binder. Whereas high values of the Rand index \(R\) between \(\pi*\) and \(\pi\) correspond to high concordance between the partitions, the N-invariant Binder loss \(L\) for a partition \(\pi*\) in estimating \(\pi\) is \(L = (1-R)*(n-1)/n\), meaning that low values correspond to high concordance between the partitions. This package reports the N-invariant Binder loss. The original Binder loss equals the N-invariant Binder loss multiplied by \(n^2 / 2\). Only the pairwise similarity matrix is required for this loss, but samples can be provided. As originally discussed by Binder (1978), two mistakes are possible: 1. Placing two items in separate clusters when in truth they belong to the same cluster, and 2. Placing two items in the same cluster when in truth they belong to separate clusters. Without loss of generality, the cost of the second mistake is fixed at one. The default cost of the first mistake is also one, but can be specified with the argument a. For a discussion of general weights, see Dahl, Johnson, and M<U+00FC>ller (2020). For a discussion of the equal weights case, see also Dahl (2006), Lau and Green (2007), Dahl and Newton (2007), Fritsch and Ickstadt (2009), and Wade and Ghahramani (2018).
"omARI": One Minus Adjusted Rand Index. Computes the expectation of one minus the adjusted Rand index (Hubert and Arabie, 1985). Whereas high values of the adjusted Rand index between \(\pi*\) and \(\pi\) correspond to high concordance between the partitions, the loss associated with the adjusted Rand index for a partition \(\pi*\) in estimating \(\pi\) is one minus the adjusted Rand index between the partitions, meaning that low values correspond to high concordance between the partitions. Samples from a partition distribution are required for this loss. See Fritsch and Ickstadt (2009).
"omARI.approx": Approximation of One Minus Adjusted Rand Index. Computes the first-order approximation of the expectation of one minus the adjusted Rand index. The adjusted Rand index involves a ratio and the first-order approximation of the expectation is based on \(E(X/Y) \approx E(X)/E(Y)\). Only the pairwise similarity matrix is required for this criterion, but samples can be provided. See Fritsch and Ickstadt (2009).
"VI": Variation of Information. Computes the expectations of the (generalized) variation of information loss. Samples from a partition distribution are required for this loss. See Meil<U+0103> (2007), Wade and Ghahramani (2018), and Rastelli and Friel (2018). The original variation of information of Meil<U+0103> (2007) has been extended to the generalized variation of information of Dahl, Johnson, and M<U+00FC>ller (2020) to allow for unequal weighting of two possible mistakes: 1. Placing two items in separate clusters when in truth they belong to the same cluster, and 2. Placing two items in the same cluster when in truth they belong to separate clusters. Without loss of generality, the weight for the second mistake is fixed at one. The default weight of the first mistake is also one, but can be specified with the argument a. See Dahl, Johnson, M<U+00FC>ller (2020).
"VI.lb": Lower Bound of the Variation of Information. Computes the lower bound of the expectation of the variation of information loss, where the lower bound is obtained by Jensen's inequality. Only the pairwise similarity matrix is required for this criterion, but samples can be provided. See Wade and Ghahramani (2018).
"NVI": Normalized Variation of Information. Computes the expectation of the normalized variation of information loss. Samples from a partition distribution are required for this loss. See Vinh, Epps, and Bailey (2010) and Rastelli and Friel (2018).
"ID": Information Distance. Computes the expectation of the information distance (\(D_{max}\)) loss. Samples from a partition distribution are required for this loss. See Vinh, Epps, and Bailey (2010).
"NID": Normalized Information Distance. Computes the expectation of the normalized information distance loss. Samples from a partition distribution are required for this loss. See Vinh, Epps, and Bailey (2010) and Rastelli and Friel (2018).

The functions RI and ARI are convenience functions. Note that:

binder(p1, p2) = ( 1 - RI(p1, p2) )*(n-1)/n
omARI(p1, p2) = 1 - ARI(p1, p2)

References

W. M. Rand (1971), Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66: 846<U+2013>850.

D. A. Binder (1978), Bayesian cluster analysis, Biometrika 65, 31-38.

L. Hubert and P. Arabie (1985), Comparing Partitions. Journal of Classification, 2, 193<U+2013>218.

D. B. Dahl (2006), Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model, in Bayesian Inference for Gene Expression and Proteomics, Kim-Anh Do, Peter M<U+00FC>ller, Marina Vannucci (Eds.), Cambridge University Press.

J. W. Lau and P. J. Green (2007), Bayesian model based clustering procedures, Journal of Computational and Graphical Statistics 16, 526-558.

M. Meil<U+0103> (2007), Comparing Clusterings - an Information Based Distance. Journal of Multivariate Analysis, 98: 873<U+2013>895.

D. B. Dahl and M. A. Newton (2007), Multiple Hypothesis Testing by Clustering Treatment Effects, Journal of the American Statistical Association, 102, 517-526.

A. Fritsch and K. Ickstadt (2009), An improved criterion for clustering based on the posterior similarity matrix, Bayesian Analysis, 4, 367-391.

N. X. Vinh, J. Epps, and J. Bailey (2010), Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance, Journal of Machine Learning Research, 11, 2837-2854.

S. Wade and Z. Ghahramani (2018), Bayesian cluster analysis: Point estimation and credible balls. Bayesian Analysis, 13:2, 559-626.

R. Rastelli and N. Friel (2018), Optimal Bayesian estimators for latent variable cluster models. Statistics and Computing, 28, 1169-1186.

D. B. Dahl, D. J. Johnson, and P. M<U+00FC>ller (2020), Search Algorithms and Loss Functions for Bayesian Clustering, in preparation.

Examples

Run this code

# NOT RUN {
# For examples, use 'nCores=1' per CRAN rules, but in practice omit this.
probs <- psm(iris.clusterings, nCores=1)
partitions <- iris.clusterings[1:5,]

all.equal(partition.loss(partitions, probs, loss=binder(a=2)), binder(partitions, probs, a=2))
all.equal(partition.loss(partitions, partitions, loss=omARI()), omARI(partitions, partitions))
all.equal(partition.loss(partitions, partitions, loss=VI(a=0.8)), VI(partitions, partitions, a=0.8))

p1 <- iris.clusterings[1,]
p2 <- iris.clusterings[2,]

VI(p1, p2, a=1.0)
all.equal(binder(p1, p2), ( 1 - RI(p1, p2) ) * (length(p1)-1) / length(p1))
all.equal(omARI(p1, p2), 1 - ARI(p1, p2))

# }

Run the code above in your browser using DataLab