predict.textmodel_lda: Prediction method for textmodel_lda

Description

Predicts topics of documents with a fitted LDA model. Prediction is performed by a Gibbs sampling with words allocated to topics in the fitted LDA. The result becomes different from topics() even for the same documents because predict() triggers additional iterations.

Usage

# S3 method for textmodel_lda
predict(
  object,
  newdata = NULL,
  max_iter = 2000,
  verbose = quanteda_options("verbose"),
  ...
)

Value

textmodel_seededlda() and textmodel_lda() returns a list of model parameters. theta is the distribution of topics over documents; phi is the distribution of words over topics. alpha and beta are the small constant added to the frequency of words to estimate theta and phi, respectively, in Gibbs sampling. Other elements in the list subject to change.

Arguments

object: a fitted LDA textmodel
newdata: dfm on which prediction should be made
max_iter: the maximum number of iteration in Gibbs sampling.
verbose: logical; if TRUE print diagnostic information during fitting.
...: not used

Details

To predict topics of new documents (i.e. out-of-sample), first, create a new LDA model from a existing LDA model passed to model in textmodel_lda(); second, apply topics() to the new model. The model argument takes objects created either by textmodel_lda() or textmodel_seededlda().

References

Lu, Bin et al. (2011). "Multi-aspect Sentiment Analysis with Topic Models". doi:10.5555/2117693.2119585. Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops.

Watanabe, Kohei & Zhou, Yuan (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.

Examples

Run this code

# \donttest{
require(seededlda)
require(quanteda)

corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
    dfm_remove(stopwords('en'), min_nchar = 2) %>%
    dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile",
             max_docfreq = 0.1, docfreq_type = "prop")

# unsupervised LDA
lda <- textmodel_lda(head(dfmt, 450), 6)
terms(lda)
topics(lda)
lda2 <- textmodel_lda(tail(dfmt, 50), model = lda) # new documents
topics(lda2)

# semisupervised LDA
dict <- dictionary(list(people = c("family", "couple", "kids"),
                        space = c("alien", "planet", "space"),
                        moster = c("monster*", "ghost*", "zombie*"),
                        war = c("war", "soldier*", "tanks"),
                        crime = c("crime*", "murder", "killer")))
slda <- textmodel_seededlda(dfmt, dict, residual = TRUE, min_termfreq = 10)
terms(slda)
topics(slda)

# }

Run the code above in your browser using DataLab