step_text_normalization: text_normalizationming of tokenlist variables

Description

step_text_normalization creates a specification of a recipe step that will perform Unicode Normalization

Usage

step_text_normalization(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  columns = NULL,
  normalization_form = "nfc",
  skip = FALSE,
  id = rand_id("text_normalization")
)
# S3 method for step_text_normalization
tidy(x, ...)

Arguments

recipe

A recipe object. The step will be added to the sequence of operations for this recipe.

...

One or more selector functions to choose which variables will be transformed. See recipes::selections() for more details. For the tidy method, these are not currently used.

role

Not used by this step since no new variables are created.

trained

A logical to indicate if the recipe has been baked.

columns

A list of tibble results that define the encoding. This is NULL until the step is trained by recipes::prep.recipe().

normalization_form

A single character string determining the Unicode Normalization. Must be one of "nfc", "nfd", "nfkd", "nfkc", or "nfkc_casefold". Defaults to "nfc". See stringi::stri_trans_nfc() for more details.

skip

A logical. Should the step be skipped when the recipe is baked by recipes::bake.recipe()? While all operations are baked when recipes::prep.recipe() is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE as it may affect the computations for subsequent operations.

A character string that is unique to this step to identify it.

A step_text_normalization object.

Value

An updated version of recipe with the new step added to the sequence of existing steps (if any).

Examples

Run this code

# NOT RUN {
if (requireNamespace("stringi", quietly = TRUE)) {
library(recipes)

sample_data <- tibble(text = c("sch\U00f6n", "scho\U0308n"))

okc_rec <- recipe(~ ., data = sample_data) %>%
  step_text_normalization(text)
  
okc_obj <- okc_rec %>%
  prep()

juice(okc_obj, text) %>% 
  slice(1:2)

juice(okc_obj) %>% 
  slice(2) %>% 
  pull(text) 
  
tidy(okc_rec, number = 1)
tidy(okc_obj, number = 1)
}
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples