library(knitr) opts_chunk$set(message = FALSE, warning = FALSE) options(width = 100, dplyr.width = 150) library(ggplot2) theme_set(theme_bw())
Topic modeling is a method for unsupervised classification of documents, by modeling each document as a mixture of topics and each topic as a mixture of words. Latent Dirichlet allocation is a particularly popular method for fitting a topic model.
We can use tidy text principles, as described in the main vignette, to approach topic modeling using consistent and effective tools. In particular, we'll be using tidying functions for LDA objects from the topicmodels package.
Can we tell the difference between Dickens, Wells, Verne, and Austen?
Suppose a vandal has broken into your study and torn apart four of your books:
- Great Expectations by Charles Dickens
- The War of the Worlds by H.G. Wells
- Twenty Thousand Leagues Under the Sea by Jules Verne
- Pride and Prejudice by Jane Austen
This vandal has torn the books into individual chapters, and left them in one large pile. How can we restore these disorganized chapters to their original books?
titles <- c("Twenty Thousand Leagues under the Sea", "The War of the Worlds", "Pride and Prejudice", "Great Expectations") books <- gutenberg_works(title %in% titles) %>% gutenberg_download(meta_fields = "title")
# Downloading from Project Gutenberg can sometimes not work on automated servers # such as Travis-CI: see # https://github.com/ropenscilabs/gutenbergr/issues/6#issuecomment-231596903 # this is a workaround load(system.file("extdata", "books.rda", package = "tidytext"))
As pre-processing, we divide these into chapters, use tidytext's
unnest_tokens to separate them into words, then remove
stop_words. We're treating every chapter as a separate "document", each with a name like
Great Expectations_1 or
Pride and Prejudice_11.
library(tidytext) library(stringr) library(tidyr) by_chapter <- books %>% group_by(title) %>% mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>% ungroup() %>% filter(chapter > 0) by_chapter_word <- by_chapter %>% unite(title_chapter, title, chapter) %>% unnest_tokens(word, text) word_counts <- by_chapter_word %>% anti_join(stop_words) %>% count(title_chapter, word, sort = TRUE) %>% ungroup() word_counts
Latent Dirichlet Allocation with the topicmodels package
Right now this data frame is in a tidy form, with one-term-per-document-per-row. However, the topicmodels package requires a
DocumentTermMatrix (from the tm package). As described in this vignette, we can cast a one-token-per-row table into a
DocumentTermMatrix with tidytext's
chapters_dtm <- word_counts %>% cast_dtm(title_chapter, word, n) chapters_dtm
Now we are ready to use the topicmodels package to create a four topic LDA model.
library(topicmodels) chapters_lda <- LDA(chapters_dtm, k = 4, control = list(seed = 1234)) chapters_lda
(In this case we know there are four topics because there are four books; in practice we may need to try a few different values of
Now tidytext gives us the option of returning to a tidy analysis, using the
augment verbs borrowed from the broom package. In particular, we start with the
chapters_lda_td <- tidy(chapters_lda) chapters_lda_td
Notice that this has turned the model into a one-topic-per-term-per-row format. For each combination the model has $\beta$, the probability of that term being generated from that topic.
We could use dplyr's
top_n to find the top 5 terms within each topic:
top_terms <- chapters_lda_td %>% group_by(topic) %>% top_n(5, beta) %>% ungroup() %>% arrange(topic, -beta) top_terms
This model lends itself to a visualization:
library(ggplot2) theme_set(theme_bw()) top_terms %>% mutate(term = reorder(term, beta)) %>% ggplot(aes(term, beta)) + geom_bar(stat = "identity") + facet_wrap(~ topic, scales = "free") + theme(axis.text.x = element_text(size = 15, angle = 90, hjust = 1))
These topics are pretty clearly associated with the four books! There's no question that the topic of "nemo", "sea", and "nautilus" belongs to Twenty Thousand Leagues Under the Sea, and that "jane", "darcy", and "elizabeth" belongs to Pride and Prejudice. We see "pip" and "joe" from Great Expectations and "martians", "black", and "night" from The War of the Worlds.
Each chapter was a "document" in this analysis. Thus, we may want to know which topics are associated with each document. Can we put the chapters back together in the correct books?
chapters_lda_gamma <- tidy(chapters_lda, matrix = "gamma") chapters_lda_gamma
matrix = "gamma" returns a tidied version with one-document-per-topic-per-row. Now that we have these document classifications, we can see how well our unsupervised learning did at distinguishing the four books. First we re-separate the document name into title and chapter:
chapters_lda_gamma <- chapters_lda_gamma %>% separate(document, c("title", "chapter"), sep = "_", convert = TRUE) chapters_lda_gamma
Then we examine what fraction of chapters we got right for each:
ggplot(chapters_lda_gamma, aes(gamma, fill = factor(topic))) + geom_histogram() + facet_wrap(~ title, nrow = 2)
We notice that almost all of the chapters from Pride and Prejudice, War of the Worlds, and Twenty Thousand Leagues Under the Sea were uniquely identified as a single topic each.
chapter_classifications <- chapters_lda_gamma %>% group_by(title, chapter) %>% top_n(1, gamma) %>% ungroup() %>% arrange(gamma) chapter_classifications
We can determine this by finding the consensus book for each, which we note is correct based on our earlier visualization:
book_topics <- chapter_classifications %>% count(title, topic) %>% top_n(1, n) %>% ungroup() %>% transmute(consensus = title, topic) book_topics
Then we see which chapters were misidentified:
chapter_classifications %>% inner_join(book_topics, by = "topic") %>% count(title, consensus)
We see that only a few chapters from Great Expectations were misclassified. Not bad for unsupervised clustering!
By word assignments:
One important step in the topic modeling expectation-maximization algorithm is assigning each word in each document to a topic. The more words in a document are assigned to that topic, generally, the more weight (
gamma) will go on that document-topic classification.
We may want to take the original document-word pairs and find which words in each document were assigned to which topic. This is the job of the
assignments <- augment(chapters_lda, data = chapters_dtm)
We can combine this with the consensus book titles to find which words were incorrectly classified.
assignments <- assignments %>% separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>% inner_join(book_topics, by = c(".topic" = "topic")) assignments
We can, for example, create a "confusion matrix" using dplyr's
count and tidyr's
assignments %>% count(title, consensus, wt = count) %>% spread(consensus, n, fill = 0)
We notice that almost all the words for Pride and Prejudice, Twenty Thousand Leagues Under the Sea, and War of the Worlds were correctly assigned, while Great Expectations had a fair amount of misassignment.
What were the most commonly mistaken words?
wrong_words <- assignments %>% filter(title != consensus) wrong_words wrong_words %>% count(title, consensus, term, wt = count) %>% ungroup() %>% arrange(desc(n))
Notice the word "flopson" here; these wrong words do not necessarily appear in the novels they were misassigned to. Indeed, we can confirm "flopson" appears only in Great Expectations:
word_counts %>% filter(word == "flopson")
The algorithm is stochastic and iterative, and it can accidentally land on a topic that spans multiple books.