textmodel_lda: Semisupervised Latent Dirichlet allocation

Description

textmodel_seededlda() implements semisupervised Latent Dirichlet allocation (seeded-LDA). The estimator's code adopted from the GibbsLDA++ library (Xuan-Hieu Phan, 2007). textmodel_seededlda() allows users to specify topics using a seed word dictionary.

Usage

textmodel_lda(
  x,
  k = 10,
  max_iter = 2000,
  alpha = NULL,
  beta = NULL,
  model = NULL,
  verbose = quanteda_options("verbose")
)
textmodel_seededlda(
  x,
  dictionary,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  residual = 0,
  weight = 0.01,
  max_iter = 2000,
  alpha = NULL,
  beta = NULL,
  ...,
  verbose = quanteda_options("verbose")
)

Value

textmodel_seededlda() and textmodel_lda() returns a list of model parameters. theta is the distribution of topics over documents; phi is the distribution of words over topics. alpha and beta are the small constant added to the frequency of words to estimate theta and phi, respectively, in Gibbs sampling. Other elements in the list subject to change.

Arguments

x: the dfm on which the model will be fit
k: the number of topics; determined automatically by the number of keys in dictionary in textmodel_seededlda().
max_iter: the maximum number of iteration in Gibbs sampling.
alpha: the value to smooth topic-document distribution; defaults to alpha = 50 / k.
beta: the value to smooth topic-word distribution; defaults to beta = 0.1.
model: a fitted LDA model; if provided, textmodel_lda() inherits parameters from an existing model. See details.
verbose: logical; if TRUE print diagnostic information during fitting.
dictionary: a quanteda::dictionary() with seed words that define topics.
valuetype: see quanteda::valuetype
case_insensitive: see quanteda::valuetype
residual: the number of undefined topics. They are named "other" by default, but it can be changed via base::options(slda_residual_name).
weight: pseudo count given to seed words as a proportion of total number of words in x.
...: passed to quanteda::dfm_trim to restrict seed words based on their term or document frequency. This is useful when glob patterns in the dictionary match too many words.

Details

To predict topics of new documents (i.e. out-of-sample), first, create a new LDA model from a existing LDA model passed to model in textmodel_lda(); second, apply topics() to the new model. The model argument takes objects created either by textmodel_lda() or textmodel_seededlda().

References

Lu, Bin et al. (2011). "Multi-aspect Sentiment Analysis with Topic Models". doi:10.5555/2117693.2119585. Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops.

Watanabe, Kohei & Zhou, Yuan (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.

Examples

Run this code

# \donttest{
require(seededlda)
require(quanteda)

corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
    dfm_remove(stopwords('en'), min_nchar = 2) %>%
    dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile",
             max_docfreq = 0.1, docfreq_type = "prop")

# unsupervised LDA
lda <- textmodel_lda(head(dfmt, 450), 6)
terms(lda)
topics(lda)
lda2 <- textmodel_lda(tail(dfmt, 50), model = lda) # new documents
topics(lda2)

# semisupervised LDA
dict <- dictionary(list(people = c("family", "couple", "kids"),
                        space = c("alien", "planet", "space"),
                        moster = c("monster*", "ghost*", "zombie*"),
                        war = c("war", "soldier*", "tanks"),
                        crime = c("crime*", "murder", "killer")))
slda <- textmodel_seededlda(dfmt, dict, residual = TRUE, min_termfreq = 10)
terms(slda)
topics(slda)

# }

Run the code above in your browser using DataLab