lda_tidiers: Tidiers for LDA objects from the topicmodels package

Description

Tidy the results of a Latent Dirichlet Allocation.

Usage

"tidy"(x, matrix = c("beta", "gamma"), log = FALSE, ...)
"augment"(x, data, ...)
"glance"(x, ...)

Arguments

An LDA (or LDA_VEM) object from the topicmodels package

matrix

Whether to tidy the beta (per-term-per-topic, default) or gamma (per-document-per-topic) matrix

log

Whether beta/gamma should be on a log scale, default FALSE

...

Extra arguments, not used

data

For augment, the data given to the LDA function, either as a DocumentTermMatrix or as a tidied table with "document" and "term" columns

Value

tidy returns a tidied version of either the beta or gamma matrix.If matrix == "beta" (default), returns a table with one row per topic and term, with columns

topic: Topic, as an integer
term: Term
beta: Probability of a term generated from a topic according to the multinomial model

If matrix == "gamma", returns a table with one row per topic and document, with columns

topic: Topic, as an integer
document: Document name or ID
gamma: Probability of topic given document

augment returns a table with one row per original document-term pair, such as is returned by tdm_tidiers:

document: Name of document (if present), or index
term: Term
.topic: Topic assignment

If the data argument is provided, any columns in the original data are included, combined based on the document and term columns.glance always returns a one-row table, with columns

iter: Number of iterations used
terms: Number of terms in the model
alpha: If an LDA_VEM, the parameter of the Dirichlet distribution for topics over documents

Examples

Run this code


set.seed(2016)
library(dplyr)
library(topicmodels)

data("AssociatedPress", package = "topicmodels")
ap <- AssociatedPress[1:100, ]
lda <- LDA(ap, control = list(alpha = 0.1), k = 4)

# get term distribution within each topic
td_lda <- tidy(lda)
td_lda

library(ggplot2)

# visualize the top terms within each topic
td_lda_filtered <- td_lda %>%
  filter(beta > .004) %>%
  mutate(term = reorder(term, beta))

ggplot(td_lda_filtered, aes(term, beta)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ topic, scales = "free") +
  theme(axis.text.x = element_text(angle = 90, size = 15))

# get classification of each document
td_lda_docs <- tidy(lda, matrix = "gamma")
td_lda_docs

doc_classes <- td_lda_docs %>%
  group_by(document) %>%
  top_n(1) %>%
  ungroup()

doc_classes

# which were we most uncertain about?
doc_classes %>%
  arrange(gamma)

Run the code above in your browser using DataLab