Converting to and from Document-Term Matrix and Corpus objects

knitr::opts_chunk$set(fig.width = 7, fig.height = 7, message = FALSE, warning = FALSE, eval = requireNamespace("tm", quietly = TRUE) && requireNamespace("quanteda", quietly = TRUE) && requireNamespace("topicmodels", quietly = TRUE)) library(ggplot2) theme_set(theme_bw())

Tidying document-term matrices

Many existing text mining datasets are in the form of a DocumentTermMatrix class (from the tm package). For example, consider the corpus of 2246 Associated Press articles from the topicmodels package:

library(tm) data("AssociatedPress", package = "topicmodels") AssociatedPress

If we want to analyze this with tidy tools, we need to turn it into a one-term-per-document-per-row data frame first. The tidy function does this. (For more on the tidy verb, see the broom package).

library(dplyr) library(tidytext) ap_td <- tidy(AssociatedPress)

Just as shown in this vignette, having the text in this format is convenient for analysis with the tidytext package. For example, you can perform sentiment analysis on these newspaper articles.

ap_sentiments <- ap_td %>% inner_join(get_sentiments("bing"), by = c(term = "word")) ap_sentiments

We can find the most negative documents:

library(tidyr) ap_sentiments %>% count(document, sentiment, wt = count) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative) %>% arrange(sentiment)

Or visualize which words contributed to positive and negative sentiment:

library(ggplot2) ap_sentiments %>% count(sentiment, term, wt = count) %>% filter(n >= 150) %>% mutate(n = ifelse(sentiment == "negative", -n, n)) %>% mutate(term = reorder(term, n)) %>% ggplot(aes(term, n, fill = sentiment)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ylab("Contribution to sentiment")

Note that a tidier is also available for the dfm class from the quanteda package:

library(methods) data("data_corpus_inaugural", package = "quanteda") d <- quanteda::dfm(data_corpus_inaugural, verbose = FALSE) d tidy(d)

Casting tidy text data into a DocumentTermMatrix

Some existing text mining tools or algorithms work only on sparse document-term matrices. Therefore, tidytext provides cast_ verbs for converting from a tidy form to these matrices.

ap_td # cast into a Document-Term Matrix ap_td %>% cast_dtm(document, term, count) # cast into a Term-Document Matrix ap_td %>% cast_tdm(term, document, count) # cast into quanteda's dfm ap_td %>% cast_dfm(term, document, count) # cast into a Matrix object m <- ap_td %>% cast_sparse(document, term, count) class(m) dim(m)

This allows for easy reading, filtering, and processing to be done using dplyr and other tidy tools, after which the data can be converted into a document-term matrix for machine learning applications.

Tidying corpus data

You can also tidy Corpus objects from the tm package. For example, consider a Corpus containing 20 documents, one for each

reut21578 <- system.file("texts", "crude", package = "tm") reuters <- VCorpus(DirSource(reut21578), readerControl = list(reader = readReut21578XMLasPlain)) reuters

The tidy verb creates a table with one row per document:

reuters_td <- tidy(reuters) reuters_td

Similarly, you can tidy a corpus object from the quanteda package:

library(quanteda) data("data_corpus_inaugural") data_corpus_inaugural inaug_td <- tidy(data_corpus_inaugural) inaug_td

This lets us work with tidy tools like unnest_tokens to analyze the text alongside the metadata.

inaug_words <- inaug_td %>% unnest_tokens(word, text) %>% anti_join(stop_words) inaug_words

We could then, for example, see how the appearance of a word changes over time:

inaug_freq <- inaug_words %>% count(Year, word) %>% complete(Year, word, fill = list(n = 0)) %>% group_by(Year) %>% mutate(year_total = sum(n), percent = n / year_total) %>% ungroup() inaug_freq

For example, we can use the broom package to perform logistic regression on each word.

library(broom) models <- inaug_freq %>% group_by(word) %>% filter(sum(n) > 50) %>% do(tidy(glm(cbind(n, year_total - n) ~ Year, ., family = "binomial"))) %>% ungroup() %>% filter(term == "Year") models models %>% filter(term == "Year") %>% arrange(desc(abs(estimate)))

You can show these models as a volcano plot, which compares the effect size with the significance:

library(ggplot2) models %>% mutate(adjusted.p.value = p.adjust(p.value)) %>% ggplot(aes(estimate, adjusted.p.value)) + geom_point() + scale_y_log10() + geom_text(aes(label = word), vjust = 1, hjust = 1, check_overlap = TRUE) + xlab("Estimated change over time") + ylab("Adjusted p-value")

We can also use the ggplot2 package to display the top 6 terms that have changed in frequency over time.

library(scales) models %>% top_n(6, abs(estimate)) %>% inner_join(inaug_freq) %>% ggplot(aes(Year, percent)) + geom_point() + geom_smooth() + facet_wrap(~ word) + scale_y_continuous(labels = percent_format()) + ylab("Frequency of word in speech")