Learn R Programming

⚠️There's a newer version (1.1.0) of this package.Take me there.

textrecipes

Introduction

textrecipes contain extra steps for the recipes package for preprocessing text data.

Installation

You can install the released version of textrecipes from CRAN with:

install.packages("textrecipes")

Install the development version from GitHub with:

require("devtools")
install_github("tidymodels/textrecipes")

Example

In the following example we will go through the steps needed, to convert a character variable to the TF-IDF of its tokenized words after removing stopwords, and, limiting ourself to only the 100 most used words. The preprocessing will be conducted on the variable essay0 and essay1.

library(recipes)
library(textrecipes)
library(modeldata)

data(okc_text)

okc_rec <- recipe(~  essay0 + essay1, data = okc_text) %>%
  step_tokenize(essay0, essay1) %>% # Tokenizes to words by default
  step_stopwords(essay0, essay1) %>% # Uses the english snowball list by default
  step_tokenfilter(essay0, essay1, max_tokens = 100) %>%
  step_tfidf(essay0, essay1)
   
okc_obj <- okc_rec %>%
  prep()
   
str(bake(okc_obj, okc_text), list.len = 15)
#> tibble [750 × 200] (S3: tbl_df/tbl/data.frame)
#>  $ tfidf_essay0_also      : num [1:750] 0 0 0.0252 0.2232 0 ...
#>  $ tfidf_essay0_always    : num [1:750] 0 0 0 0 0 ...
#>  $ tfidf_essay0_amp       : num [1:750] 0.47 0.583 0 0 0 ...
#>  $ tfidf_essay0_anything  : num [1:750] 0 0 0.113 0 0 ...
#>  $ tfidf_essay0_area      : num [1:750] 0 0 0 0 0 ...
#>  $ tfidf_essay0_around    : num [1:750] 0 0 0.0348 0 0 ...
#>  $ tfidf_essay0_art       : num [1:750] 0 0 0 0 0 ...
#>  $ tfidf_essay0_back      : num [1:750] 0 0 0 0 0 ...
#>  $ tfidf_essay0_bay       : num [1:750] 0 0 0 0 0 ...
#>  $ tfidf_essay0_believe   : num [1:750] 0 0 0 0 0.314 ...
#>  $ tfidf_essay0_big       : num [1:750] 0.0781 0 0 0 0 ...
#>  $ tfidf_essay0_bit       : num [1:750] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_essay0_br        : num [1:750] 0.121 0.565 0.121 0 0 ...
#>  $ tfidf_essay0_can       : num [1:750] 0.0488 0 0.0244 0 0 ...
#>  $ tfidf_essay0_city      : num [1:750] 0 0 0 0 0 0 0 0 0 0 ...
#>   [list output truncated]

Type chart

textrecipes includes a little departure in design from recipes, in the sense that it allows for some input and output to be in the form of list columns. To avoid confusion, here is a table of steps with their expected input and output respectively. Notice how you need to end with numeric for future analysis to work.

StepInputOutput
step_tokenize()charactertokenlist()
step_untokenize()tokenlist()character
step_lemma()tokenlist()tokenlist()
step_stem()tokenlist()tokenlist()
step_stopwords()tokenlist()tokenlist()
step_pos_filter()tokenlist()tokenlist()
step_ngram()tokenlist()tokenlist()
step_tokenfilter()tokenlist()tokenlist()
step_tokenmerge()tokenlist()tokenlist()
step_tfidf()tokenlist()numeric
step_tf()tokenlist()numeric
step_texthash()tokenlist()numeric
step_word_embeddings()tokenlist()numeric
step_textfeature()characternumeric
step_sequence_onehot()characternumeric
step_lda()characternumeric
step_text_normalization()charactercharacter

This means that valid sequences includes

recipe(~ ., data = data) %>%
  step_tokenize(text) %>%
  step_stem(text) %>%
  step_stopwords(text) %>%
  step_topwords(text) %>%
  step_tf(text)

# or

recipe(~ ., data = data) %>%
  step_tokenize(text) %>%
  step_stem(text) %>%
  step_tfidf(text)

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Copy Link

Version

Install

install.packages('textrecipes')

Monthly Downloads

1,608

Version

0.3.0

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Emil Hvitfeldt

Last Published

July 8th, 2020

Functions in textrecipes (0.3.0)

step_lemma

step_stem

step_stopwords

step_sequence_onehot

Generate the basic set of text features
step_pos_filter

step_ngram

Generate ngrams from tokenlist
step_lda

Calculates lda dimension estimates
step_text_normalization

%>%

Pipe operator
rcpp_ngram

ngram generator
step_tf

Term frequency of tokens
step_tokenmerge

Generate the basic set of text features
textrecipes-package

textrecipes: Extra 'Recipes' for Text Processing
step_word_embeddings

Pretrained word embeddings of tokens
step_tfidf

Term frequency-inverse document frequency of tokens
step_textfeature

Generate the basic set of text features
step_texthash

Term frequency of tokens
step_tokenfilter

Filter the tokens based on term frequency
step_tokenize

Tokenization of character variables
tokenlist

Create tokenlist object
step_untokenize

tunable.step_ngram

tunable methods for step_ngram