lda_tidiers: Tidiers for LDA and CTM objects from the topicmodels package

Description

Tidy the results of a Latent Dirichlet Allocation or Correlated Topic Model.

Usage

# S3 method for LDA
tidy(x, matrix = c("beta", "gamma"), log = FALSE, ...)
# S3 method for CTM
tidy(x, matrix = c("beta", "gamma"), log = FALSE, ...)
# S3 method for LDA
augment(x, data, ...)
# S3 method for CTM
augment(x, data, ...)
# S3 method for LDA
glance(x, ...)
# S3 method for CTM
glance(x, ...)

Arguments

An LDA or CTM (or LDA_VEM/CTA_VEM) object from the topicmodels package

matrix

Whether to tidy the beta (per-term-per-topic, default) or gamma (per-document-per-topic) matrix

log

Whether beta/gamma should be on a log scale, default FALSE

...

Extra arguments, not used

data

For augment, the data given to the LDA or CTM function, either as a DocumentTermMatrix or as a tidied table with "document" and "term" columns

Value

tidy returns a tidied version of either the beta or gamma matrix.

If matrix == "beta" (default), returns a table with one row per topic and term, with columns

topic: Topic, as an integer
term: Term
beta: Probability of a term generated from a topic according to the multinomial model

If matrix == "gamma", returns a table with one row per topic and document, with columns

topic: Topic, as an integer
document: Document name or ID
gamma: Probability of topic given document

augment returns a table with one row per original document-term pair, such as is returned by tdm_tidiers:

document: Name of document (if present), or index
term: Term
.topic: Topic assignment

If the data argument is provided, any columns in the original data are included, combined based on the document and term columns.

glance always returns a one-row table, with columns

iter: Number of iterations used
terms: Number of terms in the model
alpha: If an LDA_VEM, the parameter of the Dirichlet distribution for topics over documents

Examples

Run this code

# NOT RUN {
if (requireNamespace("topicmodels", quietly = TRUE)) {
  set.seed(2016)
  library(dplyr)
  library(topicmodels)

  data("AssociatedPress", package = "topicmodels")
  ap <- AssociatedPress[1:100, ]
  lda <- LDA(ap, control = list(alpha = 0.1), k = 4)

  # get term distribution within each topic
  td_lda <- tidy(lda)
  td_lda

  library(ggplot2)

  # visualize the top terms within each topic
  td_lda_filtered <- td_lda %>%
    filter(beta > .004) %>%
    mutate(term = reorder(term, beta))

  ggplot(td_lda_filtered, aes(term, beta)) +
    geom_bar(stat = "identity") +
    facet_wrap(~ topic, scales = "free") +
    theme(axis.text.x = element_text(angle = 90, size = 15))

  # get classification of each document
  td_lda_docs <- tidy(lda, matrix = "gamma")
  td_lda_docs

  doc_classes <- td_lda_docs %>%
    group_by(document) %>%
    top_n(1) %>%
    ungroup()

  doc_classes

  # which were we most uncertain about?
  doc_classes %>%
    arrange(gamma)
}

# }

Run the code above in your browser using DataLab