When selecting variables or model terms in step
functions, dplyr
-like tools are used. The selector
functions can choose variables based on their name, current role, data
type, or any combination of these. The selectors are passed as any other
argument to the step. If the variables are explicitly stated in the step
function, this might be similar to:
recipe( ~ ., data = USArrests) %>% step_pca(Murder, Assault, UrbanPop, Rape, num = 3)
The first four arguments indicate which variables should be used in the
PCA while the last argument is a specific argument to
step_pca
.
Note that:
The selector arguments should not contain functions beyond those supported (see below).
These arguments are not evaluated until the prep
function
for the step is executed.
The dplyr
-like syntax allows for negative sings to exclude
variables (e.g. -Murder
) and the set of selectors will
processed in order.
A leading exclusion in these arguments (e.g. -Murder
) has
the effect of adding all variables to the list except the excluded
variable(s).
Also, select helpers from the dplyr
package can also be used:
starts_with
, ends_with
,
contains
, matches
,
num_range
, and everything
.
For example:
recipe(Species ~ ., data = iris) %>% step_center(starts_with("Sepal"), -contains("Width"))
would only select Sepal.Length
Inline functions that specify computations, such as log(x)
,
should not be used in selectors and will produce an error. A list of
allowed selector functions is below.
Columns of the design matrix that may not exist when the step is coded can
also be selected. For example, when using step_pca
, the number of
columns created by feature extraction may not be known when subsequent
steps are defined. In this case, using matches("^PC")
will select
all of the columns whose names start with "PC" once those columns
are created.
There are sets of functions that can be used to select variables based on
their role or type: has_role
and has_type
.
For convenience, there are also functions that are more specific:
all_numeric
, all_nominal
,
all_predictors
, and all_outcomes
. These can
be used in conjunction with the previous functions described for
selecting variables using their names:
data(biomass) recipe(HHV ~ ., data = biomass) %>% step_center(all_numeric(), -all_outcomes())
This results in all the numeric predictors: carbon, hydrogen, oxygen, nitrogen, and sulfur.
If a role for a variable has not been defined, it will never be selected using role-specific selectors.
All steps use these techniques to define variables for steps
except one: step_interact
requires traditional model
formula representations of the interactions and takes a single formula
as the argument to select the variables.
The complete list of allowable functions in steps:
By name: starts_with
,
ends_with
, contains
,
matches
, num_range
, and
everything
By role: has_role
,
all_predictors
, and all_outcomes
By type: has_type
, all_numeric
,
and all_nominal