mlrCPO: Composable Preprocessing Operators for mlr
GSoC 2017 Project: Operator Based Machine Learning Pipeline Construction
What is CPO?
> task = iris.task
> task %<>>% cpoScale(scale = FALSE) %>>% cpoPca() %>>% # pca
> cpoFilterChiSquared(abs = 3) %>>% # filter
> cpoModelMatrix(~ 0 + .^2) # interactions
> head(getTaskData(task))
PC1 PC2 PC3 PC1:PC2 PC1:PC3 PC2:PC3 Species
1 -2.684126 -0.3193972 0.02791483 0.8573023 -0.07492690 -0.008915919 setosa
2 -2.714142 0.1770012 0.21046427 -0.4804064 -0.57122986 0.037252434 setosa
3 -2.888991 0.1449494 -0.01790026 -0.4187575 0.05171367 -0.002594632 setosa
4 -2.745343 0.3182990 -0.03155937 -0.8738398 0.08664130 -0.010045316 setosa
5 -2.728717 -0.3267545 -0.09007924 0.8916204 0.24580071 0.029433798 setosa
6 -2.280860 -0.7413304 -0.16867766 1.6908707 0.38473006 0.125045884 setosa
"Composable Preprocessing Operators" are an extension for the mlr ("Machine Learning in R") project which represent preprocessing operations (e.g. imputation or PCA) in the form of R objects. These CPO objects can be composed to form more complex operations, they can be applied to data sets, and they can be attached to mlr Learner
objects to generate complex machine learning pipelines that perform both preprocessing and model fitting.
Table of Contents
Short Overview
CPOs are created by calling a constructor.
> cpoScale()
scale(center = TRUE, scale = TRUE)
The created objects have Hyperparameters that can be manipulated using getHyperPars
, setHyperPars
etc, just like in mlr
.
> getHyperPars(cpoScale())
$scale.center
[1] TRUE
$scale.scale
[1] TRUE
> setHyperPars(cpoScale(), scale.center = FALSE)
scale(center = FALSE, scale = TRUE)
The %>>%
-operator can be used to create complex pipelines.
> cpoScale() %>>% cpoPca()
(scale >> pca)(scale.center = TRUE, scale.scale = TRUE)
This operator can also be used to apply an operation to a data set:
> head(iris %>>% cpoPca())
Species PC1 PC2 PC3 PC4
1 setosa -5.912747 2.302033 0.007401536 0.003087706
2 setosa -5.572482 1.971826 0.244592251 0.097552888
3 setosa -5.446977 2.095206 0.015029262 0.018013331
4 setosa -5.436459 1.870382 0.020504880 -0.078491501
5 setosa -5.875645 2.328290 -0.110338269 -0.060719326
6 setosa -6.477598 2.324650 -0.237202487 -0.021419633
Or to attach an operation to an MLR Learner
, which extends the Learner's hyperparameters by the CPO's hyperparameters:
> cpoScale() %>>% makeLearner("classif.logreg")
Learner classif.logreg.scale from package stats
Type: classif
Name: ; Short name:
Class: CPOLearner
Properties: numerics,factors,prob,twoclass
Predict-Type: response
Hyperparameters: model=FALSE,scale.center=TRUE,scale.scale=TRUE
Get a list of all CPO
s by calling listCPO()
.
Installation
Install mlrCPO
from CRAN, or use the more recent GitHub version:
devtools::install_github("mlr-org/mlrCPO")
Documentation
To effectively use mlrCPO
, you should first familiarize yourself a little with mlr
. There is an extensive tutorial online; for more ressources on mlr
, see the overview on mlr
's GitHub page.
To get familiar with mlrCPO
, it is recommended that you read the vignettes. For each vignette, there is also a compact version that has all the R output removed.
- First Steps: Introduction and short overview (compact version).
- mlrCPO Core: Description of general tools for
CPO
handling (compact version). - Builtin CPOs: Listing and description of all builtin
CPO
s (compact version). - Custom CPOs: How to create your own
CPO
s. (compact version). - CPO Internals: A small intro guide for developers into the code base. See the
info
directory for pdf / html versions.
For more documentation of individual mlrCPO
functions, use R's built-in help()
functionality.
Project Status
The foundation of mlrCPO
is built and is reasonably stable, only small improvements and stability fixes are expected here. There are still many concrete implementations of preprocessing operators to be written.
Similar Projects
There are other projects that provide functionality similar to mlrCPO
for other machine learning frameworks. The caret project provides some preprocessing functionality, though not as flexible as mlrCPO
. dplyr has similar syntax and some overlapping functionality, but is focused ultimately more on (manual) data manipulation instead of (machine learning pipeline integrated) preprocessing. Much more close to mlrCPO
's functionality is the Recipes package. scikit learn also has preprocessing functionality built in.
License
The BSD 2-Clause License