Learn R Programming

textrecipes

Introduction

textrecipes contain extra steps for the recipes package for preprocessing text data.

Installation

You can install the released version of textrecipes from CRAN with:

install.packages("textrecipes")

Install the development version from GitHub with:

# install.packages("pak")
pak::pak("tidymodels/textrecipes")

Example

In the following example we will go through the steps needed, to convert a character variable to the TF-IDF of its tokenized words after removing stopwords, and, limiting ourself to only the 10 most used words. The preprocessing will be conducted on the variable medium and artist.

library(recipes)
library(textrecipes)
library(modeldata)

data("tate_text")

okc_rec <- recipe(~ medium + artist, data = tate_text) %>%
  step_tokenize(medium, artist) %>%
  step_stopwords(medium, artist) %>%
  step_tokenfilter(medium, artist, max_tokens = 10) %>%
  step_tfidf(medium, artist)

okc_obj <- okc_rec %>%
  prep()

str(bake(okc_obj, tate_text))
#> tibble [4,284 × 20] (S3: tbl_df/tbl/data.frame)
#>  $ tfidf_medium_colour     : num [1:4284] 2.31 0 0 0 0 ...
#>  $ tfidf_medium_etching    : num [1:4284] 0 0.86 0.86 0.86 0 ...
#>  $ tfidf_medium_gelatin    : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_medium_lithograph : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_medium_paint      : num [1:4284] 0 0 0 0 2.35 ...
#>  $ tfidf_medium_paper      : num [1:4284] 0 0.422 0.422 0.422 0 ...
#>  $ tfidf_medium_photograph : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_medium_print      : num [1:4284] 0 0 0 0 0 ...
#>  $ tfidf_medium_screenprint: num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_medium_silver     : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_akram      : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_beuys      : num [1:4284] 0 0 0 0 0 ...
#>  $ tfidf_artist_ferrari    : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_john       : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_joseph     : num [1:4284] 0 0 0 0 0 ...
#>  $ tfidf_artist_león       : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_richard    : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_schütte    : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_thomas     : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ tfidf_artist_zaatari    : num [1:4284] 0 0 0 0 0 0 0 0 0 0 ...

Breaking changes

As of version 0.4.0, step_lda() no longer accepts character variables and instead takes tokenlist variables.

the following recipe

recipe(~text_var, data = data) %>%
  step_lda(text_var)

can be replaced with the following recipe to achive the same results

lda_tokenizer <- function(x) text2vec::word_tokenizer(tolower(x))
recipe(~text_var, data = data) %>%
  step_tokenize(text_var,
    custom_token = lda_tokenizer
  ) %>%
  step_lda(text_var)

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Copy Link

Version

Install

install.packages('textrecipes')

Monthly Downloads

1,608

Version

1.1.0

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Emil Hvitfeldt

Last Published

March 18th, 2025

Functions in textrecipes (1.1.0)

step_stem

Stemming of Token Variables
step_dummy_hash

Indicator Variables via Feature Hashing
step_ngram

Generate n-grams From Token Variables
step_lemma

Lemmatization of Token Variables
step_sequence_onehot

Positional One-Hot encoding of Tokens
step_pos_filter

Part of Speech Filtering of Token Variables
step_clean_names

Clean Variable Names
step_lda

Calculate LDA Dimension Estimates of Tokens
step_text_normalization

Normalization of Character Variables
step_textfeature

Calculate Set of Text Features
step_tokenmerge

Combine Multiple Token Variables Into One
step_tfidf

Term Frequency-Inverse Document Frequency of Tokens
step_tokenfilter

Filter Tokens Based on Term Frequency
step_untokenize

Untokenization of Token Variables
step_texthash

Feature Hashing of Tokens
step_tf

Term frequency of Tokens
step_stopwords

Filtering of Stop Words for Tokens Variables
textrecipes-package

textrecipes: Extra 'Recipes' for Text Processing
step_word_embeddings

Pretrained Word Embeddings of Tokens
step_tokenize_sentencepiece

Sentencepiece Tokenization of Character Variables
tokenlist

Create Token Object
step_tokenize_bpe

BPE Tokenization of Character Variables
step_tokenize

Tokenization of Character Variables
step_tokenize_wordpiece

Wordpiece Tokenization of Character Variables
tunable.step_dummy_hash

tunable methods for textrecipes
%>%

Pipe operator
show_tokens

Show token output of recipe
step_clean_levels

Clean Categorical Levels
all_tokenized

Role Selection
ngram

Nram generator
required_pkgs.step_clean_levels

S3 methods for tracking which additional packages are needed for steps.
emoji_samples

Sample sentences with emojis
reexports

Objects exported from other packages
count_functions

List of all feature counting functions