Learn R Programming

textrecipes (version 0.2.1)

step_ngram: Generate ngrams from tokenlist

Description

step_ngram creates a specification of a recipe step that will convert a tokenlist into a list of ngram of tokens.

Usage

step_ngram(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  columns = NULL,
  num_tokens = 3L,
  delim = "_",
  skip = FALSE,
  id = rand_id("ngram")
)

# S3 method for step_ngram tidy(x, ...)

Arguments

recipe

A recipe object. The step will be added to the sequence of operations for this recipe.

...

One or more selector functions to choose variables. For step_ngram, this indicates the variables to be encoded into a tokenlist. See recipes::selections() for more details. For the tidy method, these are not currently used.

role

Not used by this step since no new variables are created.

trained

A logical to indicate if the recipe has been baked.

columns

A list of tibble results that define the encoding. This is NULL until the step is trained by recipes::prep.recipe().

num_tokens

The number of tokens in the n-gram. This must be an integer greater than or equal to 1. Defaults to 3.

delim

The separator between words in an n-gram. Defaults to "_".

skip

A logical. Should the step be skipped when the recipe is baked by recipes::bake.recipe()? While all operations are baked when recipes::prep.recipe() is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE as it may affect the computations for subsequent operations.

id

A character string that is unique to this step to identify it.

x

A step_ngram object.

Value

An updated version of recipe with the new step added to the sequence of existing steps (if any).

See Also

step_tokenize() to turn character into tokenlist.

Other tokenlist to tokenlist steps: step_lemma(), step_pos_filter(), step_stem(), step_stopwords(), step_tokenfilter(), step_tokenmerge()

Examples

Run this code
# NOT RUN {
library(recipes)
library(modeldata)
data(okc_text)

okc_rec <- recipe(~ ., data = okc_text) %>%
  step_tokenize(essay0) %>%
  step_ngram(essay0)
  
okc_obj <- okc_rec %>%
  prep()

juice(okc_obj, essay0) %>% 
  slice(1:2)

juice(okc_obj) %>% 
  slice(2) %>% 
  pull(essay0) 
  
tidy(okc_rec, number = 2)
tidy(okc_obj, number = 2)
# }

Run the code above in your browser using DataLab