rboTopics: Pairwise RBO Similarities

Description

Calculates the similarity of all pairwise topic combinations using the rank-biased overlap (RBO) Similarity.

Usage

rboTopics(topics, k, p, progress = TRUE, pm.backend, ncpus)

Value

[named list] with entries

sims: [lower triangular named matrix] with all pairwise similarities of the given topics.
wordslimit: [integer] = vocabulary size. See jaccardTopics for original purpose.
wordsconsidered: [integer] = vocabulary size. See jaccardTopics for original purpose.
param: [named list] with parameter type [character(1)] = "RBO Similarity", k [integer(1)] and p [0,1]. See above for explanation.

Arguments

topics: [named matrix]
The counts of vocabularies/words (row wise) in topics (column wise).
k: [integer(1)]
Maximum depth for evaluation. Words down to this rank are considered for the calculation of similarities.
p: [0,1]
Weighting parameter. Lower values emphasizes top ranked words while values that go towards 1 correspond to equal weights for each evaluation depth.
progress: [logical(1)]
Should a nice progress bar be shown? Turning it off, could lead to significantly faster calculation. Default is TRUE. If pm.backend is set, parallelization is done and no progress bar will be shown.
pm.backend: [character(1)]
One of "multicore", "socket" or "mpi". If pm.backend is set, parallelStart is called before computation is started and parallelStop is called after.
ncpus: [integer(1)]
Number of (physical) CPUs to use. If pm.backend is passed, default is determined by availableCores.

Details

The RBO Similarity for two topics $\bm z_{i}$ and $\bm z_{j}$ is calculated by $$RBO(\bm z_{i}, \bm z_{j} \mid k, p) = 2p^k\frac{\left|Z_{i}^{(k)} \cap Z_{j}^{(k)}\right|}{\left|Z_{i}^{(k)}\right| + \left|Z_{j}^{(k)}\right|} + \frac{1-p}{p} \sum_{d=1}^k 2 p^d\frac{\left|Z_{i}^{(d)} \cap Z_{j}^{(d)}\right|}{\left|Z_{i}^{(d)}\right| + \left|Z_{j}^{(d)}\right|}$$ with $Z_{i}^{(d)}$ is the vocabulary set of topic $\bm z_{i}$ down to rank $d$. Ties in ranks are resolved by taking the minimum.

The value wordsconsidered describes the number of words per topic ranked at rank $k$ or above.

References

Webber, William, Alistair Moffat and Justin Zobel (2010). "A similarity measure for indefinite rankings". In: ACM Transations on Information Systems 28(4), p.20:1–-20:38, tools:::Rd_expr_doi("10.1145/1852102.1852106").

Examples

Run this code

res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
rbo = rboTopics(topics, k = 12, p = 0.9)
rbo

sim = getSimilarity(rbo)
dim(sim)

Run the code above in your browser using DataLab