mlrCPO is a toolset that enriches mlr with a diverse set of preprocessing operators.
Composable Preprocessing Operators (“CPO”s) are first-class R objects that can be applied to
data.frames and mlr Tasks to modify data, they can be attached to mlr
Learners to add preprocessing to machine learning algorithms, and they can be composed to form
preprocessing pipelines.
mlrCPO focuses on preprocessing as part of automated machine learning pipelines. This means that
it is designed with the expectation that the same preprocessing options are applied to incoming training data,
and test data. A common mistake in machine learning is that a machine learning method is evaluated (e.g. using
resampling) on a dataset after that dataset has been cleaned up and preprocessed in one go. The proper evaluation
would need to consider that the preprocessing of training data may not be influenced by any information contained
in the test data set. mlrCPO takes this duality into account by providing CPO objects that
run on training data, and which then create CPOTrained objects that can be used on test data (or entirely new
prediction data).
This focus on preprocessing is the reason for a strict separation between “Feature Operation” CPOs,
“Target Operation” CPOs, and “Retrafoless” CPOs (see OperatingType).
The first class only changes
(predictor) features of a dataset, and does so in a way reproducible on test data. The second class only changes
(outcome) target data of a dataset, and is then able to invert the prediction, made by a learner
on new data, back to the space of the original target data. The “Retrafoless” CPO only operates
during training and may only add or subtract data rows (e.g. for SMOTE-ing or subsampling), without transforming
the space of either predictor or outcome variables.
CPO's design is supposed to help its user avoid bugs and errors. Therefore it often avoids doing things
implicitly and relies on explicit commands e.g. for removing data or converting between datatypes. It has certain
restrictions in place (e.g. CPOProperties, CPOTrainedCapability) that try to make it hard to do the
wrong thing while not being in the way of the right thing.
Other packages with similar, partially overlapping functionality are recipes, dplyr, and caret.