Text mining with Spark & sparklyr

knitr::opts_chunk$set(echo = TRUE) #knitr::opts_chunk$set(cache = TRUE) library(sparklyr) library(dplyr)

This article focuses on a set of functions that can be used for text mining with Spark and sparklyr. The main goal is to illustrate how to perform most of the data preparation and analysis with commands that will run inside the Spark cluster, as opposed to locally in R. Because of that, the amount of data used will be small.

Data source

For this example, there are two files that will be analyzed. They are both the full works of Sir Arthur Conan Doyle and Mark Twain. The files were downloaded from the Gutenberg Project site via the gutenbergr package. Intentionally, no data cleanup was done to the files prior to this analysis. See the appendix below to see how the data was downloaded and prepared.

readLines("arthur_doyle.txt", 10)

Data Import

Connect to Spark

An additional goal of this article is to encourage the reader to try it out, so a simple Spark local mode session is used.

library(sparklyr) library(dplyr) sc <- spark_connect(master = "local", version = "2.1.0")


The spark_read_text() is a new function which works like readLines() but for sparklyr. It comes in handy when non-structured data, such as lines in a book, is what is available for analysis.

# Imports Mark Twain's file # Setting up the path to the file in a Windows OS laptop twain_path <- paste0("file:///", getwd(), "/mark_twain.txt") twain <- spark_read_text(sc, "twain", twain_path)
# Imports Sir Arthur Conan Doyle's file doyle_path <- paste0("file:///", getwd(), "/arthur_doyle.txt") doyle <- spark_read_text(sc, "doyle", doyle_path)

Data transformation

The objective is to end up with a tidy table inside Spark with one row per word used. The steps will be:

  1. The needed data transformations apply to the data from both authors. The data sets will be appended to one another
  2. Punctuation will be removed
  3. The words inside each line will be separated, or tokenized
  4. For a cleaner analysis, stop words will be removed
  5. To tidy the data, each word in a line will become its own row
  6. The results will be saved to Spark memory


  • sdf_bind_rows() appends the doyle Spark Dataframe to the twain Spark Dataframe. This function can be used in lieu of a dplyr::bind_rows() wrapper function. For this exercise, the column author is added to differentiate between the two bodies of work.
    all_words <- doyle %>% mutate(author = "doyle") %>% sdf_bind_rows({ twain %>% mutate(author = "twain")}) %>% filter(nchar(line) > 0)


  • The Hive UDF, regexp_replace, is used as a sort of gsub() that works inside Spark. In this case it is used to remove punctuation. The usual [:punct:] regular expression did not work well during development, so a custom list is provided. For more information, see the Hive Functions section in the dplyr page.
    all_words <- all_words %>% mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " "))


  • ft_tokenizer() uses the Spark API to separate each word. It creates a new list column with the results. ```{r} all_words <- all_words %>% ft_tokenizer(input.col = "line",
             output.col = "word_list")

head(all_words, 4)

### ft_stop_words_remover()

- `ft_stop_words_remover()` is a new function that, as its name suggests, takes care of removing stop words from the previous transformation. It expects a list column, so it is important to sequence it correctly after a `ft_tokenizer()` command. In the sample results, notice that the new `wo_stop_words` column contains less items than `word_list`.
all_words <- all_words %>%
  ft_stop_words_remover(input.col = "word_list",
                        output.col = "wo_stop_words")

head(all_words, 4)


  • The Hive UDF explode performs the job of unnesting the tokens into their own row. Some further filtering and field selection is done to reduce the size of the dataset. ```{r} all_words <- all_words %>% mutate(word = explode(wo_stop_words)) %>% select(word, author) %>% filter(nchar(word) > 2)

head(all_words, 4)

### compute()

- `compute()` will operate this transformation and cache the results in Spark memory.  It is a good idea to pass a name to `compute()` to make it easier to identify it inside the Spark environment.  In this case the name will be *all_words*
all_words <- all_words %>%

Full code

This is what the code would look like on an actual analysis:

all_words <- doyle %>% mutate(author = "doyle") %>% sdf_bind_rows({ twain %>% mutate(author = "twain")}) %>% filter(nchar(line) > 0) %>% mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " ")) %>% ft_tokenizer(input.col = "line", output.col = "word_list") %>% ft_stop_words_remover(input.col = "word_list", output.col = "wo_stop_words") %>% mutate(word = explode(wo_stop_words)) %>% select(word, author) %>% filter(nchar(word) > 2) %>% compute("all_words")

Data Analysis

Words used the most

word_count <- all_words %>% group_by(author, word) %>% tally() %>% arrange(desc(n)) word_count

Words used by Doyle and not Twain

doyle_unique <- filter(word_count, author == "doyle") %>% anti_join(filter(word_count, author == "twain"), by = "word") %>% arrange(desc(n)) %>% compute("doyle_unique") doyle_unique
doyle_unique %>% head(100) %>% collect() %>% with(wordcloud::wordcloud( word, n, colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")))

Twain and Sherlock

The word cloud highlighted something interesting. The word lestrade is listed as one of the words used by Doyle but not Twain. Lestrade is the last name of a major character in the Sherlock Holmes books. It makes sense that the word "sherlock" appears considerably more times than "lestrade" in Doyle's books, so why is Sherlock not in the word cloud? Did Mark Twain use the word "sherlock" in his writings?

all_words %>% filter(author == "twain", word == "sherlock") %>% tally()

The all_words table contains 16 instances of the word sherlock in the words used by Twain in his works. The instr Hive UDF is used to extract the lines that contain that word in the twain table. This Hive function works can be used instead of base::grep() or stringr::str_detect(). To account for any word capitalization, the lower command will be used in mutate() to make all words in the full text lower cap.

instr & lower

Most of these lines are in a short story by Mark Twain called A Double Barrelled Detective Story. As per the Wikipedia page about this story, this is a satire by Twain on the mystery novel genre, published in 1902.

twain %>% mutate(line = lower(line)) %>% filter(instr(line, "sherlock") > 0) %>% pull(line)


gutenbergr package

This is an example of how the data for this article was pulled from the Gutenberg site:

library(gutenbergr) gutenberg_works() %>% filter(author == "Twain, Mark") %>% pull(gutenberg_id) %>% gutenberg_download() %>% pull(text) %>% writeLines("mark_twain.txt")