step_tokenize: Tokenization of character variables

Description

`step_tokenize` creates a *specification* of a recipe step that will convert a character predictor into a list of tokens.

Usage

step_tokenize(recipe, ..., role = NA, trained = FALSE,
  columns = NULL, options = list(), token = "words",
  custom_token = NULL, skip = FALSE, id = rand_id("tokenize"))
# S3 method for step_tokenize
tidy(x, ...)

Arguments

recipe

A recipe object. The step will be added to the sequence of operations for this recipe.

...

One or more selector functions to choose variables. For `step_tokenize`, this indicates the variables to be encoded into a list column. See [recipes::selections()] for more details. For the `tidy` method, these are not currently used.

role

Not used by this step since no new variables are created.

trained

A logical to indicate if the recipe has been baked.

columns

A list of tibble results that define the encoding. This is `NULL` until the step is trained by [recipes::prep.recipe()].

options

A list of options passed to the tokenizer.

token

Unit for tokenizing. Built-in options from the [tokenizers] package are "words" (default), "characters", "character_shingles", "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", "regex", "tweets" (tokenization by word that preserves usernames, hashtags, and URLS ), "ptb" (Penn Treebank), "skip_ngrams" and "word_stems".

custom_token

User supplied tokenizer. use of this argument will overwrite the token argument. Must take a character vector as input and output a list of character vectors.

skip

A logical. Should the step be skipped when the recipe is baked by [recipes::bake.recipe()]? While all operations are baked when [recipes::prep.recipe()] is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using `skip = TRUE` as it may affect the computations for subsequent operations.

A character string that is unique to this step to identify it

A `step_tokenize` object.

Value

An updated version of `recipe` with the new step added to the sequence of existing steps (if any).

Details

Tokenization is the act of splitting a character string into smaller parts to be further analysed. This step uses the `tokenizers` package which includes heuristics to split the text into paragraphs tokens, word tokens amough others. `textrecipes` keeps the tokens in a list-column and other steps will do their tasks on those list-columns before transforming them back to numeric.

Working will `textrecipes` will always start by calling `step_tokenize` followed by modifying and filtering steps.

Examples

Run this code

# NOT RUN {
library(recipes)

data(okc_text)

okc_rec <- recipe(~ ., data = okc_text) %>%
  step_tokenize(essay0) 
  
okc_obj <- okc_rec %>%
  prep(training = okc_text, retain = TRUE)

juice(okc_obj, essay0) %>%
  slice(1:2)

juice(okc_obj) %>%
  slice(2) %>%
  pull(essay0)
  
tidy(okc_rec, number = 1)
tidy(okc_obj, number = 1)

okc_obj_chars <- recipe(~ ., data = okc_text) %>%
  step_tokenize(essay0, token = "characters") %>%
  prep(training = okc_text, retain = TRUE)

juice(okc_obj_chars) %>%
  slice(2) %>%
  pull(essay0)
# }