textmodel_seededlda: Semisupervised Latent Dirichlet allocation

Description

Implements semisupervised Latent Dirichlet allocation (Seeded LDA). textmodel_seededlda() allows users to specify topics using a seed word dictionary. Users can run Seeded Sequential LDA by setting gamma > 0.

Usage

textmodel_seededlda(
  x,
  dictionary,
  levels = 1,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  residual = 0,
  weight = 0.01,
  max_iter = 2000,
  auto_iter = FALSE,
  alpha = 0.5,
  beta = 0.1,
  gamma = 0,
  adjust_alpha = 0,
  batch_size = 1,
  ...,
  verbose = quanteda_options("verbose")
)

Value

The same as textmodel_lda() with extra elements for dictionary.

Arguments

x: the dfm on which the model will be fit.
dictionary: a quanteda::dictionary() with seed words that define topics.
levels: levels of entities in a hierarchical dictionary to be used as seed words. See also quanteda::flatten_dictionary.
valuetype: see quanteda::valuetype
case_insensitive: see quanteda::valuetype
residual: the number of undefined topics. They are named "other" by default, but it can be changed via base::options(seededlda_residual_name).
weight: determines the size of pseudo counts given to matched seed words.
max_iter: the maximum number of iteration in Gibbs sampling.
auto_iter: if TRUE, stops Gibbs sampling on convergence before reaching max_iter. See details.
alpha: the values to smooth topic-document distribution.
beta: the values to smooth topic-word distribution.
gamma: a parameter to determine change of topics between sentences or paragraphs. When gamma > 0, Gibbs sampling of topics for the current document is affected by the previous document's topics.
adjust_alpha: [experimental] if adjust_alpha > 0, automatically adjust alpha by the size of the topics. The smallest value of adjusted alpha will be alpha * (1 - adjust_alpha).
batch_size: split the corpus into the smaller batches (specified in proportion) for distributed computing; it is disabled when a batch include all the documents batch_size = 1.0. See details.
...: passed to quanteda::dfm_trim to restrict seed words based on their term or document frequency. This is useful when glob patterns in the dictionary match too many words.
verbose: logical; if TRUE print diagnostic information during fitting.

References

Lu, Bin et al. (2011). "Multi-aspect Sentiment Analysis with Topic Models". doi:10.5555/2117693.2119585. Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops.

Watanabe, Kohei & Zhou, Yuan. (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.

Watanabe, Kohei & Baturo, Alexander. (2023). "Seeded Sequential LDA: A Semi-supervised Algorithm for Topic-specific Analysis of Sentences". doi:10.1177/08944393231178605. Social Science Computer Review.

Examples

Run this code

# \donttest{
require(seededlda)
require(quanteda)

corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
    dfm_remove(stopwords("en"), min_nchar = 2) %>%
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")

dict <- dictionary(list(people = c("family", "couple", "kids"),
                        space = c("alien", "planet", "space"),
                        moster = c("monster*", "ghost*", "zombie*"),
                        war = c("war", "soldier*", "tanks"),
                        crime = c("crime*", "murder", "killer")))
lda_seed <- textmodel_seededlda(dfmt, dict, residual = TRUE, min_termfreq = 10,
                                max_iter = 500)
terms(lda_seed)
topics(lda_seed)
# }

Run the code above in your browser using DataLab