Learn R Programming

seededlda (version 0.5)

textmodel_lda: Semisupervised Latent Dirichlet allocation

Description

textmodel_seededlda() implements semisupervised Latent Dirichlet allocation (seeded-LDA). The estimator's code adopted from the GibbsLDA++ library (Xuan-Hieu Phan, 2007). textmodel_seededlda() allows identification of pre-defined topics by semisupervised learning with a seed word dictionary.

Usage

textmodel_lda(
  x,
  k = 10,
  max_iter = 2000,
  alpha = NULL,
  beta = NULL,
  verbose = quanteda_options("verbose")
)

textmodel_seededlda( x, dictionary, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, residual = FALSE, weight = 0.01, max_iter = 2000, alpha = NULL, beta = NULL, verbose = quanteda_options("verbose") )

Arguments

x

the dfm on which the model will be fit

k

the number of topics

max_iter

the maximum number of iteration in Gibbs sampling.

alpha

the hyper parameter for topic-document distribution

beta

the hyper parameter for topic-word distribution

verbose

logical; if TRUE print diagnostic information during fitting.

dictionary

a quanteda::dictionary() with seed words as examples of topics.

valuetype
case_insensitive
residual

if TRUE a residual topic (or "garbage topic") will be added to user-defined topics.

weight

pseudo count given to seed words as a proportion of total number of words in x.

References

Lu, Bin et al. (2011). Multi-aspect Sentiment Analysis with Topic Models. Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops.

Watanabe, Kohei & Zhou, Yuan (2020). Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches. Social Science Computer Review.

See Also

topicmodels

Examples

Run this code
# NOT RUN {
require(quanteda)

data("data_corpus_moviereviews", package = "quanteda.textmodels")
corp <- head(data_corpus_moviereviews, 500)
dfmt <- dfm(corp, remove_number = TRUE) %>%
    dfm_remove(stopwords('en'), min_nchar = 2) %>%
    dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile",
             max_docfreq = 0.1, docfreq_type = "prop")

# unsupervised LDA
lda <- textmodel_lda(dfmt, 6)
terms(lda)

# semisupervised LDA
dict <- dictionary(list(people = c("family", "couple", "kids"),
                        space = c("areans", "planet", "space"),
                        moster = c("monster*", "ghost*", "zombie*"),
                        war = c("war", "soldier*", "tanks"),
                        crime = c("crime*", "murder", "killer")))
slda <- textmodel_seededlda(dfmt, dict, residual = TRUE)
terms(slda)
# }

Run the code above in your browser using DataLab