recipe: Create a Recipe for Preprocessing Data

Description

A recipe is a description of what steps should be applied to a data set in order to get it ready for data analysis.

Usage

recipe(x, ...)
# S3 method for default
recipe(x, ...)
# S3 method for data.frame
recipe(x, formula = NULL, ..., vars = NULL, roles = NULL)
# S3 method for formula
recipe(formula, data, ...)
# S3 method for matrix
recipe(x, ...)

Arguments

x, data

A data frame or tibble of the template data set (see below).

...

Further arguments passed to or from other methods (not currently used).

formula

A model formula. No in-line functions should be used here (e.g. log(x), x:y, etc.) and minus signs are not allowed. These types of transformations should be enacted using step functions in this package. Dots are allowed as are simple multivariate outcome terms (i.e. no need for cbind; see Examples). A model formula may not be the best choice for high-dimensional data with many columns, because of problems with memory.

vars

A character string of column names corresponding to variables that will be used in any context (see below)

roles

A character string (the same length of vars) that describes a single role that the variable will take. This value could be anything but common roles are "outcome", "predictor", "case_weight", or "ID"

Value

An object of class recipe with sub-objects:

var_info

A tibble containing information about the original data set columns

term_info

A tibble that contains the current set of terms in the data set. This initially defaults to the same data contained in var_info.

steps

A list of step or check objects that define the sequence of preprocessing operations that will be applied to data. The default value is NULL

template

A tibble of the data. This is initialized to be the same as the data given in the data argument but can be different after the recipe is trained.

Details

Recipes are alternative methods for creating design matrices and for preprocessing data.

Variables in recipes can have any type of role in subsequent analyses such as: outcome, predictor, case weights, stratification variables, etc.

recipe objects can be created in several ways. If the analysis only contains outcomes and predictors, the simplest way to create one is to use a simple formula (e.g. y ~ x1 + x2) that does not contain inline functions such as log(x3). An example is given below.

Alternatively, a recipe object can be created by first specifying which variables in a data set should be used and then sequentially defining their roles (see the last example). This alternative is an excellent choice when the number of variables is very high, as the formula method is memory-inefficient with many variables.

There are two different types of operations that can be sequentially added to a recipe. Steps can include common operations like logging a variable, creating dummy variables or interactions and so on. More computationally complex actions such as dimension reduction or imputation can also be specified. Checks are operations that conduct specific tests of the data. When the test is satisfied, the data are returned without issue or modification. Otherwise, any error is thrown.

Once a recipe has been defined, the prep() function can be used to estimate quantities required for the operations using a data set (a.k.a. the training data). prep() returns another recipe.

To apply the recipe to a data set, the bake() function is used in the same manner as predict would be for models. This applies the steps to any data set.

Note that the data passed to recipe need not be the complete data that will be used to train the steps (by prep()). The recipe only needs to know the names and types of data that will be used. For large data sets, head could be used to pass the recipe a smaller data set to save time and memory.

Examples

Run this code

# NOT RUN {
###############################################
# simple example:
library(modeldata)
data(biomass)

# split data
biomass_tr <- biomass[biomass$dataset == "Training",]
biomass_te <- biomass[biomass$dataset == "Testing",]

# When only predictors and outcomes, a simplified formula can be used.
rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur,
              data = biomass_tr)

# Now add preprocessing steps to the recipe.

sp_signed <- rec %>%
  step_normalize(all_numeric_predictors()) %>%
  step_spatialsign(all_numeric_predictors())
sp_signed

# now estimate required parameters
sp_signed_trained <- prep(sp_signed, training = biomass_tr)
sp_signed_trained

# apply the preprocessing to a data set
test_set_values <- bake(sp_signed_trained, new_data = biomass_te)

# or use pipes for the entire workflow:
rec <- biomass_tr %>%
  recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_spatialsign(all_numeric_predictors())

###############################################
# multivariate example

# no need for `cbind(carbon, hydrogen)` for left-hand side
multi_y <- recipe(carbon + hydrogen ~ oxygen + nitrogen + sulfur,
                  data = biomass_tr)
multi_y <- multi_y %>%
  step_center(all_numeric_predictors()) %>%
  step_scale(all_numeric_predictors())

multi_y_trained <- prep(multi_y, training = biomass_tr)

results <- bake(multi_y_trained, biomass_te)

###############################################
# example with manually updating different roles

# best choice for high-dimensional data:

rec <- recipe(biomass_tr) %>%
  update_role(carbon, hydrogen, oxygen, nitrogen, sulfur,
           new_role = "predictor") %>%
  update_role(HHV, new_role = "outcome") %>%
  update_role(sample, new_role = "id variable") %>%
  update_role(dataset, new_role = "splitting indicator")
rec
# }

Run the code above in your browser using DataLab