Let x
and y
represent two partitions of a set of n
elements into \(K\) and \(L\), respectively,
nonempty and pairwise disjoint subsets,
e.g., two clusterings of a dataset with n
observations
represented as label vectors.
These functions quantify the similarity between x
and y
. They can be used as external cluster
validity measures, i.e., in the presence of reference (ground-truth)
partitions.
adjusted_rand_score(x, y = NULL)rand_score(x, y = NULL)
adjusted_fm_score(x, y = NULL)
fm_score(x, y = NULL)
mi_score(x, y = NULL)
normalized_mi_score(x, y = NULL)
adjusted_mi_score(x, y = NULL)
normalized_accuracy(x, y = NULL)
pair_sets_index(x, y = NULL)
an integer vector of length n (or an object coercible to)
representing a K-partition of an n-set,
or a confusion matrix with K rows and L columns (see table(x, y)
)
an integer vector of length n (or an object coercible to) representing an L-partition of the same set), or NULL (if x is an K*L confusion matrix)
A single real value giving the similarity score.
Every index except mi_score()
(which computes the mutual
information score) outputs 1 given two identical partitions.
Note that partitions are always defined up to a bijection of the set of
possible labels, e.g., (1, 1, 2, 1) and (4, 4, 2, 4)
represent the same 2-partition.
rand_score()
gives the Rand score (the "probability" of agreement
between the two partitions) and
adjusted_rand_score()
is its version corrected for chance,
see (Hubert, Arabie, 1985),
its expected value is 0.0 given two independent partitions.
Due to the adjustment, the resulting index might also be negative
for some inputs.
Similarly, fm_score()
gives the Fowlkes-Mallows (FM) score
and adjusted_fm_score()
is its adjusted-for-chance version,
see (Hubert, Arabie, 1985).
Note that both the (unadjusted) Rand and FM scores are bounded from below by \(1/(K+1)\) if \(K=L\), hence their adjusted versions are preferred.
mi_score()
, adjusted_mi_score()
and
normalized_mi_score()
are information-theoretic
scores, based on mutual information,
see the definition of \(AMI_{sum}\) and \(NMI_{sum}\)
in (Vinh et al., 2010).
normalized_accuracy()
is defined as
\((Accuracy(C_\sigma)-1/L)/(1-1/L)\), where \(C_\sigma\) is a version
of the confusion matrix for given x
and y
,
\(K \leq L\), with columns permuted based on the solution to the
Maximal Linear Sum Assignment Problem.
\(Accuracy(C_\sigma)\) is sometimes referred to as Purity,
e.g., in (Rendon et al. 2011).
pair_sets_index()
gives the Pair Sets Index (PSI)
adjusted for chance (Rezaei, Franti, 2016), \(K \leq L\).
Pairing is based on the solution to the Linear Sum Assignment Problem
of a transformed version of the confusion matrix.
Hubert L., Arabie P., Comparing Partitions, Journal of Classification 2(1), 1985, 193-218, esp. Eqs. (2) and (4).
Rendon E., Abundez I., Arizmendi A., Quiroz E.M., Internal versus external cluster validation indexes, International Journal of Computers and Communications 5(1), 2011, 27-34.
Rezaei M., Franti P., Set matching measures for external cluster validity, IEEE Transactions on Knowledge and Data Mining 28(8), 2016, 2173-2186.
Vinh N.X., Epps J., Bailey J., Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance, Journal of Machine Learning Research 11, 2010, 2837-2854.
# NOT RUN {
y_true <- iris[[5]]
y_pred <- kmeans(as.matrix(iris[1:4]), 3)$cluster
adjusted_rand_score(y_true, y_pred)
rand_score(table(y_true, y_pred)) # the same
adjusted_fm_score(y_true, y_pred)
fm_score(y_true, y_pred)
mi_score(y_true, y_pred)
normalized_mi_score(y_true, y_pred)
adjusted_mi_score(y_true, y_pred)
normalized_accuracy(y_true, y_pred)
pair_sets_index(y_true, y_pred)
# }
Run the code above in your browser using DataLab