This function is mainly used by both standardWF
and
timeseriesWF
as a means to allow for users of these two
standard workflows to specify some data pre-processing steps. These
are steps one wishes to apply to the different train and test samples
involved in an experimental comparison, before any model is learned or
any predictions are obtained. Nevertheless, the function can also be used outside of these standard
workflows for obtaining pre-processed versions of train and test
samples.
The function accepts as pre-processing functions both some already
implemented functions as well as any function defined by the user
provided these follow some protocol. Namely, these user-defined
pre-processing functions should be aware that they will be called with
a formula, a training data frame and a testing data frame in the first
three arguments. Moreover, any arguments used in the call to
standardPRE
will also be forwarded to these user-defined
functions. Finally, these functions should return a list with two
components: "train" and "test", containing the pre-processed versions
of the supplied train and test data frames.
The function already contains implementations of the following
pre-processing steps that can be used in the steps
parameter:
"scale" - that scales (subtracts the mean and divides by the standard
deviation) any knnumeric features on both the training and testing
sets. Note that the mean and standard deviation are calculated using
only the training sample.
"centralImp" - that fills in any NA
values in both sets using
the median value for numeric predictors and the mode for nominal
predictors. Once again these centrality statistics are calculated
using only the training set although they are applied to both train
and test sets.
"knnImp" - that fills in any NA
values in both sets using
the median value for numeric predictors and the mode for nominal
predictors, but using only the k-nearest neighbors to calculate these satistics.
"na.omit" - that uses the R function na.omit
to remove
any rows containing NA
's from both the training and test sets.
"undersampl" - this undersamples the training data cases that do not
belong to the minority class (this pre-processing step is only
available for classification tasks!). It takes the parameter
perc.under
that controls the level of undersampling
(defaulting to 1, which means that there would be as many cases from
the minority as from the other(s) class(es)).
"smote" - this operation uses the SMOTE (Chawla et. al. 2002)
resampling algorithm to generate a new training sample with a more
"balanced" distributions of the target class (this pre-processing step
is only available for classification tasks!). It takes the parameters
perc.under
, perc.over
and k
to control the
algorithm. Read the documentation of function smote
to
know more details.