textmodel_lda: Semisupervised Latent Dirichlet allocation

Description

textmodel_seededlda() implements semisupervised Latent Dirichlet allocation (seeded-LDA). The estimator's code adopted from the GibbsLDA++ library (Xuan-Hieu Phan, 2007). textmodel_seededlda() allows identification of pre-defined topics by semisupervised learning with a seed word dictionary.

Usage

textmodel_lda(
  x,
  k = 10,
  max_iter = 2000,
  alpha = NULL,
  beta = NULL,
  verbose = quanteda_options("verbose")
)
textmodel_seededlda(
  x,
  dictionary,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  residual = FALSE,
  weight = 0.01,
  max_iter = 2000,
  alpha = NULL,
  beta = NULL,
  verbose = quanteda_options("verbose")
)

Arguments

the dfm on which the model will be fit

the number of topics

max_iter

the maximum number of iteration in Gibbs sampling.

alpha

the hyper parameter for topic-document distribution

beta

the hyper parameter for topic-word distribution

verbose

logical; if TRUE print diagnostic information during fitting.

dictionary

a quanteda::dictionary() with seed words as examples of topics.

valuetype

see quanteda::valuetype

case_insensitive

see quanteda::valuetype

residual

if TRUE a residual topic (or "garbage topic") will be added to user-defined topics.

weight

pseudo count given to seed words as a proportion of total number of words in x.

References

Lu, Bin et al. (2011). Multi-aspect Sentiment Analysis with Topic Models. Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops.

Watanabe, Kohei & Zhou, Yuan (2020). Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches. Social Science Computer Review.

Examples

Run this code

# NOT RUN {
require(quanteda)

data("data_corpus_moviereviews", package = "quanteda.textmodels")
corp <- head(data_corpus_moviereviews, 500)
dfmt <- dfm(corp, remove_number = TRUE) %>%
    dfm_remove(stopwords('en'), min_nchar = 2) %>%
    dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile",
             max_docfreq = 0.1, docfreq_type = "prop")

# unsupervised LDA
lda <- textmodel_lda(dfmt, 6)
terms(lda)

# semisupervised LDA
dict <- dictionary(list(people = c("family", "couple", "kids"),
                        space = c("areans", "planet", "space"),
                        moster = c("monster*", "ghost*", "zombie*"),
                        war = c("war", "soldier*", "tanks"),
                        crime = c("crime*", "murder", "killer")))
slda <- textmodel_seededlda(dfmt, dict, residual = TRUE)
terms(slda)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

References

See Also

Examples