- d
A data frame
- ...
Optional. Columns to be ignored in preparation and model training,
e.g. ID columns. Unquoted; any number of columns can be included here.
- outcome
Optional. Unquoted column name that indicates the target
variable. If provided, argument must be named. If this target is 0/1, it
will be coerced to Y/N if factor_outcome is TRUE; other manipulation steps
will not be applied to the outcome.
- recipe
Optional. Recipe for how to prep d. In model deployment, pass
the output from this function in training to this argument in deployment to
prepare the deployment data identically to how the training data was
prepared. If training data is big, pull the recipe from the "recipe"
attribute of the prepped training data frame and pass that to this
argument. If present, all following arguments will be ignored.
- remove_near_zero_variance
Logical or numeric. If TRUE (default),
columns with near-zero variance will be removed. These columns are either a
single value, or the most common value is much more frequent than the
second most common value. Example: In a column with 120 "Male" and 2
"Female", the frequency ratio is 0.0167. It would be excluded by default or
if `remove_near_zero_variance` > 0.0166. Larger values will remove more
columns and this value must lie between 0 and 1.
- convert_dates
Logical or character. If TRUE (default), date and time
columns are transformed to circular representation for hour, day, month,
and year for machine learning optimization. If FALSE, date and time columns
are removed. If character, use "continuous" (same as TRUE), "categories",
or "none" (same as FALSE). "categories" makes hour, day, month, and year
readable for interpretation. If make_dummies
is TRUE, each unique
value in these features will become a new dummy variable. This will create
wide data, which is more challenging for some machine learning models. All
features with the DTS suffix will be treated as a date.
- impute
Logical or list. If TRUE (default), columns will be imputed
using mean (numeric), and new category (nominal). If FALSE, data will not
be imputed. If this is a list, it must be named, with possible entries for
`numeric_method`, `nominal_method`, `numeric_params`, `nominal_params`,
which are passed to hcai_impute
.
- collapse_rare_factors
Logical or numeric. If TRUE (default), factor
levels representing less than 3 percent of observations will be collapsed
into a new category, `other`. If numeric, must be in 0, 1, and is the
proportion of observations below which levels will be grouped into other.
See `recipes::step_other`.
- PCA
Integer or Logical. PCA reduces training time, particularly for
wide datasets, though it renders models less interpretable." If integer,
represents the number of principal components to convert the numeric data
into. If TRUE, will convert numeric data into 5 principal components. PCA
requires that data is centered and scaled and will set those params to
TRUE. Default is FALSE.
- center
Logical. If TRUE, numeric columns will be centered to have a
mean of 0. Default is FALSE, unless PCA is performed, in which case it is
TRUE.
- scale
Logical. If TRUE, numeric columns will be scaled to have a
standard deviation of 1. Default is FALSE, unless PCA is performed, in
which case it is TRUE.
- make_dummies
Logical or list. If TRUE (default), dummy columns will be
created for categorical variables. When dummy columns are created, columns
are not created for reference levels. By default, the levels are reassigned
so the mode value is the reference level. If a named list is provided,
those values will replace the reference levels. See the example for details.
- add_levels
Logical. If TRUE (default), "other" and "missing" will be
added to all nominal columns. This is protective in deployment: new levels
found in deployment will become "other" and missingness in deployment can
become "missing" if the nominal imputation method is "new_category". If
FALSE, these "other" will be added to all nominal variables if
collapse_rare_factors
is used, and "missingness" may be added
depending on details of imputation.
- logical_to_numeric
Logical. If TRUE (default), logical variables will
be converted to 0/1 integer variables.
- factor_outcome
Logical. If TRUE (default) and if all entries in
outcome are 0 or 1 they will be converted to factor with levels N and Y for
classification. Note that which level is the positive class is set in
training functions rather than here.
- no_prep
Logical. If TRUE, overrides all other arguments to FALSE so
that d is returned unmodified, except that character variables may be
coverted to factors and a tibble will be returned even if the input was
a non-tibble data frame.