textrecipes v0.0.2

0

Monthly downloads

0th

Percentile

Extra 'Recipes' for Text Processing

Converting text to numerical features requires specifically created procedures, which are implemented as steps according to the 'recipes' package. These steps allows for tokenization, filtering, counting (tf and tfidf) and feature hashing.

Readme

textrecipes

Travis build
status Coverage
status CRAN\_Status\_Badge Downloads Lifecycle:
maturing

Introduction

textrecipes contain extra steps for the recipes package for preprocessing text data.

Installation

You can install the released version of textrecipes from CRAN with:

install.packages("textrecipes")

Install the development version from GitHub with:

require("devtools")
install_github("tidymodels/textrecipes")

Example

In the following example we will go through the steps needed, to convert a character variable to the TF-IDF of its tokenized words after removing stopwords, and, limiting ourself to only the 100 most used words. The preprocessing will be conducted on the variable essay0 and essay1.

library(recipes)
library(textrecipes)

data(okc_text)

okc_rec <- recipe(~ ., data = okc_text) %>%
  step_tokenize(essay0, essay1) %>% # Tokenizes to words by default
  step_stopwords(essay0, essay1) %>% # Uses the english snowball list by default
  step_tokenfilter(essay0, essay1, max_tokens = 100) %>%
  step_tfidf(essay0, essay1)

okc_obj <- okc_rec %>%
  prep(training = okc_text)

str(bake(okc_obj, okc_text), list.len = 15)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    750 obs. of  208 variables:
#>  $ essay2                 : Factor w/ 749 levels "- being myself. i'm comfortable in my own skin.<br />\n- cooking, eating and washing dishes<br />\n- sleeping &"| __truncated__,..: 743 574 595 385 109 367 719 721 225 449 ...
#>  $ essay3                 : Factor w/ 737 levels "... is how batman i am.<br />\n<br />\ni'm a huge geek.<br />\n<br />\nrecently i've heard \"you're like a stra"| __truncated__,..: 655 192 523 403 675 698 51 46 417 309 ...
#>  $ essay4                 : Factor w/ 750 levels "- wealth of nations, the social contract, the prince.<br />\n<br />\n- coming to america, willy wonka and the c"| __truncated__,..: 611 634 695 638 104 113 378 86 293 323 ...
#>  $ essay5                 : Factor w/ 750 levels "- a tent<br />\n- a good pillow<br />\n- a funny hat in cold weather<br />\n- genuinely good and trustworthy fr"| __truncated__,..: 344 237 536 271 7 383 128 52 688 750 ...
#>  $ essay6                 : Factor w/ 749 levels "- being happy with simple things.<br />\n- whether lightness is unbearable.<br />\n- how to get to know someone"| __truncated__,..: 466 105 332 215 568 35 506 480 317 326 ...
#>  $ essay7                 : Factor w/ 750 levels "-out to dinner.<br />\n-at the movies.<br />\n-having drinks at a spot where i like the atmosphere.<br />\n-coo"| __truncated__,..: 658 419 50 292 552 248 530 116 144 461 ...
#>  $ essay8                 : Factor w/ 747 levels "-bad news everybody i received a message from the people of 2135,\nthey said the aliens attacked and devastated"| __truncated__,..: 254 704 622 548 709 497 347 298 76 42 ...
#>  $ essay9                 : Factor w/ 743 levels "- <em>you think i'm the bee's knees</em> (although obviously that\nwon't slim down the pool at all)<br />\n- <e"| __truncated__,..: 698 643 540 638 530 137 378 320 17 283 ...
#>  $ tfidf_essay0_also      : num  0 0 0.0213 0.1888 0 ...
#>  $ tfidf_essay0_always    : num  0 0 0 0 0 ...
#>  $ tfidf_essay0_amp       : num  0.457 0.567 0 0 0 ...
#>  $ tfidf_essay0_anything  : num  0 0 0.108 0 0 ...
#>  $ tfidf_essay0_area      : num  0 0 0 0 0 ...
#>  $ tfidf_essay0_around    : num  0 0 0.0327 0 0 ...
#>  $ tfidf_essay0_art       : num  0 0 0 0 0 ...
#>   [list output truncated]

Type chart

textrecipes includes a little departure in design from recipes, in the sense that it allows for some input and output to be in the form of list columns. To avoind confusion, here is a table of steps with their expected input and output respectively. Notice how you need to end with numeric for future analysis to work.

Step Input Output
step_tokenize() character list-column
step_untokenize() list-column character
step_stem() list-column list-column
step_stopwords() list-column list-column
step_tokenfilter() list-column list-column
step_tokenmerge() list-column list-column
step_tfidf() list-column numeric
step_tf() list-column numeric
step_texthash() list-column numeric
step_textfeature() character numeric
step_sequence_onehot() character numeric
step_text2vec() character numeric

This means that valid sequences includes

recipe(~ ., data = data) %>%
  step_tokenize(text) %>%
  step_stem(text) %>%
  step_stopwords(text) %>%
  step_topwords(text) %>%
  step_tf(text)

# or

recipe(~ ., data = data) %>%
  step_tokenize(text) %>%
  step_stem(text) %>%
  step_tfidf(text)

Functions in textrecipes

Name Description
step_sequence_onehot Generate the basic set of text features
count_functions Counting functions from textfeatures
step_tf Term frequency of tokens
%>% Pipe operator
step_tokenmerge Generate the basic set of text features
okc_text OkCupid Text Data
step_tokenfilter Filter the tokens based on term frequency
step_textfeature Generate the basic set of text features
step_texthash Term frequency of tokens
step_tfidf Term frequency-inverse document frequency of tokens
step_tokenize Tokenization of character variables
step_untokenize Untokenization of list-column variables
step_stem Stemming of list-column variables
step_word2vec Calculates word2vec dimension estimates
step_stopwords Filtering of stopwords from a list-column variable
No Results!

Vignettes of textrecipes

Name
cookbook---using-more-complex-recipes-involving-text.Rmd
No Results!

Last month downloads

Details

License MIT + file LICENSE
URL https://github.com/tidymodels/textrecipes
BugReports https://github.com/tidymodels/textrecipes/issues
VignetteBuilder knitr
RdMacros lifecycle
Encoding UTF-8
LazyData true
RoxygenNote 6.1.1
SystemRequirements GNU make, C++11
NeedsCompilation no
Packaged 2019-09-06 21:42:42 UTC; emilhvitfeldthansen
Repository CRAN
Date/Publication 2019-09-07 11:20:02 UTC

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/textrecipes)](http://www.rdocumentation.org/packages/textrecipes)