Learn R Programming

sumup (version 1.0.1)

topic_modeling: Topic Modeling with Latent Dirichlet Allocation (LDA)

Description

This function performs topic modeling on word count data using Latent Dirichlet Allocation (LDA). It supports both standard LDA and seeded LDA, where predefined topics can guide the topic modeling process.

Usage

topic_modeling(
  word_counts,
  seeded_topics,
  seed_weight,
  nr_topics,
  set_seed,
  lda_seed,
  lda_alpha,
  lda_best,
  lda_burnin,
  lda_verbose,
  lda_iter,
  lda_thin
)

Value

A topicmodels LDA object containing the result of the topic modeling. This object includes the topic distribution for each document and the terms associated with each topic.

Arguments

word_counts

A data frame or data.table containing word counts, with columns for the document ID (mID), word (word), and count (n).

seeded_topics

A list of character vectors representing predefined terms for each seed topic. If provided, seeded LDA will be performed.

seed_weight

A numeric value indicating the weight assigned to the seeded terms in the LDA model. This parameter influences how strongly the predefined seed topics affect the topic modeling.

nr_topics

An integer specifying the number of topics to be modeled by the LDA algorithm.

set_seed

A numeric value setting the seed for the topic modeling algorithm. Default set to 1234.

lda_seed

A numeric seed to be set for Gibbs Sampling. Default set to 1000.

lda_alpha

A numeric value that set the initial value for alpha.

lda_best

If TRUE only the model with the maximum (posterior) likelihood is returned, by default equals TRUE.

lda_burnin

A number of omitted Gibbs iterations at beginning, by default equals 0.

lda_verbose

A numeric value. If a positive integer, then the progress is reported every verbose iterations. If 0 (default), no output is generated during model fitting.

lda_iter

Number of Gibbs iterations (after omitting the burnin iterations), by default equals 2000.

lda_thin

Number of omitted in-between Gibbs iterations, by default equals iter.

Details

The topic_modeling function performs topic modeling using Latent Dirichlet Allocation (LDA) on a document-term matrix (DTM). If seeded_topics is provided, a seeded LDA approach is used where predefined topics help guide the model's generation of topics. The function supports:

  • Standard LDA: Uses the traditional Gibbs sampling approach to estimate topics from word counts.

  • Seeded LDA: Incorporates predefined seed terms into the LDA model by assigning a weight (seed_weight) to these terms.

The function uses the topicmodels package for LDA and the slam package to manipulate sparse matrices. The results are captured in an LDA model object, which contains topic-word distributions and document-topic assignments.