BTM: Construct a Biterm Topic Model on Short Text

Description

The Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms)

A biterm consists of two words co-occurring in the same context, for example, in the same short text window.
BTM models the biterm occurrences in a corpus (unlike LDA models which model the word occurrences in a document).
It's a generative model. In the generation procedure, a biterm is generated by drawing two words independently from a same topic z. In other words, the distribution of a biterm \(b=(wi,wj)\) is defined as: \(P(b) = \sum_k{P(wi|z)*P(wj|z)*P(z)}\) where k is the number of topics you want to extract.
Estimation of the topic model is done with the Gibbs sampling algorithm. Where estimates are provided for \(P(w|k)=phi\) and \(P(z)=theta\).

Usage

BTM(
  data,
  k = 5,
  alpha = 50/k,
  beta = 0.01,
  iter = 1000,
  window = 15,
  background = FALSE,
  trace = FALSE,
  biterms,
  detailed = FALSE
)

Arguments

data

a tokenised data frame containing one row per token with 2 columns

the first column is a context identifier (e.g. a tweet id, a document id, a sentence id, an identifier of a survey answer, an identifier of a part of a text)
the second column is a column called of type character containing the sequence of words occurring within the context identifier

integer with the number of topics to identify

alpha

numeric, indicating the symmetric dirichlet prior probability of a topic P(z). Defaults to 50/k.

beta

numeric, indicating the symmetric dirichlet prior probability of a word given the topic P(w|z). Defaults to 0.1.

iter

integer with the number of iterations of Gibbs sampling

window

integer with the window size for biterm extraction. Defaults to 15.

background

logical if set to TRUE, the first topic is set to a background topic that equals to the empirical word distribution. This can be used to filter out common words. Defaults to FALSE.

trace

logical indicating to print out evolution of the Gibbs sampling iterations. Defaults to FALSE.

biterms

optionally, your own set of biterms to use for modelling. This argument should be a data.frame with column names doc_id, term1, term2 and cooc, indicating how many times each biterm (as indicated by terms term1 and term2) is occurring within a certain doc_id. The field cooc indicates how many times this biterm happens with the doc_id. Note that doc_id's which are not in data are not allowed, as well as terms (in term1 and term2) which are not also in data. See the examples. If provided, the window argument is ignored and the data argument will only be used to calculate the background word frequency distribution.

detailed

logical indicating to return detailed output containing as well the vocabulary and the biterms used to construct the model. Defaults to FALSE.

Value

an object of class BTM which is a list containing

model: a pointer to the C++ BTM model
K: the number of topics
W: the number of tokens in the data
alpha: the symmetric dirichlet prior probability of a topic P(z)
beta: the symmetric dirichlet prior probability of a word given the topic P(w|z)
iter: the number of iterations of Gibbs sampling
background: indicator if the first topic is set to the background topic that equals the empirical word distribution.
theta: a vector with the topic probability p(z) which is determinated by the overall proportions of biterms in it
phi: a matrix of dimension W x K with one row for each token in the data. This matrix contains the probability of the token given the topic P(w|z). the rownames of the matrix indicate the token w
vocab: a data.frame with columns token and freq indicating the frequency of occurrence of the tokens in data. Only provided in case argument detailed is set to TRUE
biterms: the result of a call to terms with type set to biterms, containing all the biterms used in the model. Only provided in case argument detailed is set to TRUE

References

Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013, https://github.com/xiaohuiyan/BTM, https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf

Examples

Run this code

# NOT RUN {
library(udpipe)
data("brussels_reviews_anno", package = "udpipe")
x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))
x <- x[, c("doc_id", "lemma")]
model  <- BTM(x, k = 5, alpha = 1, beta = 0.01, iter = 10, trace = TRUE)
model
terms(model)
scores <- predict(model, newdata = x)

## Another small run with first topic the background word distribution
set.seed(123456)
model <- BTM(x, k = 5, beta = 0.01, iter = 10, background = TRUE)
model
terms(model)

##
## You can also provide your own set of biterms to cluster upon
## Example: cluster nouns and adjectives in the neighbourhood of one another
##
library(data.table)
library(udpipe)
x <- subset(brussels_reviews_anno, language == "nl")
x <- head(x, 5500) # take a sample to speed things up on CRAN
biterms <- as.data.table(x)
biterms <- biterms[, cooccurrence(x = lemma, 
                                  relevant = xpos %in% c("NN", "NNP", "NNS", "JJ"),
                                  skipgram = 2), 
                   by = list(doc_id)]
head(biterms)
set.seed(123456)
x <- subset(x, xpos %in% c("NN", "NNP", "NNS", "JJ"))
x <- x[, c("doc_id", "lemma")]
model <- BTM(x, k = 5, beta = 0.01, iter = 10, background = TRUE, 
             biterms = biterms, trace = 10, detailed = TRUE)
model
terms(model)
bitermset <- terms(model, "biterms")
head(bitermset$biterms, 100)

bitermset$n
sum(biterms$cooc)


# }
# NOT RUN {
##
## Visualisation either using the textplot or the LDAvis package
##
library(textplot)
library(ggraph)
library(concaveman)
plot(model, top_n = 4)

library(LDAvis)
docsize <- table(x$doc_id)
scores  <- predict(model, x)
scores  <- scores[names(docsize), ]
json <- createJSON(
  phi = t(model$phi), 
  theta = scores, 
  doc.length = as.integer(docsize),
  vocab = model$vocabulary$token, 
  term.frequency = model$vocabulary$freq)
serVis(json)
# }

Run the code above in your browser using DataLab