Learn R Programming

tidytext (version 0.1.1)

lda_tidiers: Tidiers for LDA objects from the topicmodels package

Description

Tidy the results of a Latent Dirichlet Allocation.

Usage

"tidy"(x, matrix = c("beta", "gamma"), log = FALSE, ...)
"augment"(x, data, ...)
"glance"(x, ...)

Arguments

x
An LDA (or LDA_VEM) object from the topicmodels package
matrix
Whether to tidy the beta (per-term-per-topic, default) or gamma (per-document-per-topic) matrix
log
Whether beta/gamma should be on a log scale, default FALSE
...
Extra arguments, not used
data
For augment, the data given to the LDA function, either as a DocumentTermMatrix or as a tidied table with "document" and "term" columns

Value

tidy returns a tidied version of either the beta or gamma matrix.If matrix == "beta" (default), returns a table with one row per topic and term, with columns
topic
Topic, as an integer
term
Term
beta
Probability of a term generated from a topic according to the multinomial model
If matrix == "gamma", returns a table with one row per topic and document, with columns
topic
Topic, as an integer
document
Document name or ID
gamma
Probability of topic given document
augment returns a table with one row per original document-term pair, such as is returned by tdm_tidiers:
document
Name of document (if present), or index
term
Term
.topic
Topic assignment
If the data argument is provided, any columns in the original data are included, combined based on the document and term columns.glance always returns a one-row table, with columns
iter
Number of iterations used
terms
Number of terms in the model
alpha
If an LDA_VEM, the parameter of the Dirichlet distribution for topics over documents

Examples

Run this code

set.seed(2016)
library(dplyr)
library(topicmodels)

data("AssociatedPress", package = "topicmodels")
ap <- AssociatedPress[1:100, ]
lda <- LDA(ap, control = list(alpha = 0.1), k = 4)

# get term distribution within each topic
td_lda <- tidy(lda)
td_lda

library(ggplot2)

# visualize the top terms within each topic
td_lda_filtered <- td_lda %>%
  filter(beta > .004) %>%
  mutate(term = reorder(term, beta))

ggplot(td_lda_filtered, aes(term, beta)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ topic, scales = "free") +
  theme(axis.text.x = element_text(angle = 90, size = 15))

# get classification of each document
td_lda_docs <- tidy(lda, matrix = "gamma")
td_lda_docs

doc_classes <- td_lda_docs %>%
  group_by(document) %>%
  top_n(1) %>%
  ungroup()

doc_classes

# which were we most uncertain about?
doc_classes %>%
  arrange(gamma)

Run the code above in your browser using DataLab