Learn R Programming

⚠️There's a newer version (0.3.7-7) of this package.Take me there.

mlrCPO: Composable Preprocessing Operators for mlr

GSoC 2017 Project: Operator Based Machine Learning Pipeline Construction

What is CPO?

> task = iris.task
> task %<>>% cpoScale(scale = FALSE) %>>% cpoPca() %>>%  # pca
>   cpoFilterChiSquared(abs = 3) %>>%  # filter
>   cpoModelMatrix(~ 0 + .^2)  # interactions
> head(getTaskData(task))
        PC1        PC2         PC3    PC1:PC2     PC1:PC3      PC2:PC3 Species
1 -2.684126 -0.3193972  0.02791483  0.8573023 -0.07492690 -0.008915919  setosa
2 -2.714142  0.1770012  0.21046427 -0.4804064 -0.57122986  0.037252434  setosa
3 -2.888991  0.1449494 -0.01790026 -0.4187575  0.05171367 -0.002594632  setosa
4 -2.745343  0.3182990 -0.03155937 -0.8738398  0.08664130 -0.010045316  setosa
5 -2.728717 -0.3267545 -0.09007924  0.8916204  0.24580071  0.029433798  setosa
6 -2.280860 -0.7413304 -0.16867766  1.6908707  0.38473006  0.125045884  setosa

"Composable Preprocessing Operators" are an extension for the mlr ("Machine Learning in R") project which represent preprocessing operations (e.g. imputation or PCA) in the form of R objects. These CPO objects can be composed to form more complex operations, they can be applied to data sets, and they can be attached to mlr Learner objects to generate complex machine learning pipelines that perform both preprocessing and model fitting.

Table of Contents

Short Overview

CPOs are created by calling a constructor.

> cpoScale()
scale(center = TRUE, scale = TRUE)

The created objects have Hyperparameters that can be manipulated using getHyperPars, setHyperPars etc, just like in mlr.

> getHyperPars(cpoScale())
$scale.center
[1] TRUE

$scale.scale
[1] TRUE

> setHyperPars(cpoScale(), scale.center = FALSE)
scale(center = FALSE, scale = TRUE)

The %>>%-operator can be used to create complex pipelines.

> cpoScale() %>>% cpoPca()
(scale >> pca)(scale.center = TRUE, scale.scale = TRUE)

This operator can also be used to apply an operation to a data set:

> head(iris %>>% cpoPca())
  Species       PC1      PC2          PC3          PC4
1  setosa -5.912747 2.302033  0.007401536  0.003087706
2  setosa -5.572482 1.971826  0.244592251  0.097552888
3  setosa -5.446977 2.095206  0.015029262  0.018013331
4  setosa -5.436459 1.870382  0.020504880 -0.078491501
5  setosa -5.875645 2.328290 -0.110338269 -0.060719326
6  setosa -6.477598 2.324650 -0.237202487 -0.021419633

Or to attach an operation to an MLR Learner, which extends the Learner's hyperparameters by the CPO's hyperparameters:

> cpoScale() %>>% makeLearner("classif.logreg")
Learner classif.logreg.scale from package stats
Type: classif
Name: ; Short name: 
Class: CPOLearner
Properties: numerics,factors,prob,twoclass
Predict-Type: response
Hyperparameters: model=FALSE,scale.center=TRUE,scale.scale=TRUE

Get a list of all CPOs by calling listCPO().

Installation

Install mlrCPO from CRAN, or use the more recent GitHub version:

devtools::install_github("mlr-org/mlrCPO")

Documentation

To effectively use mlrCPO, you should first familiarize yourself a little with mlr. There is an extensive tutorial online; for more ressources on mlr, see the overview on mlr's GitHub page.

To get familiar with mlrCPO, it is recommended that you read the vignettes. For each vignette, there is also a compact version that has all the R output removed.

  1. First Steps: Introduction and short overview (compact version).
  2. mlrCPO Core: Description of general tools for CPO handling (compact version).
  3. Builtin CPOs: Listing and description of all builtin CPOs (compact version).
  4. Custom CPOs: How to create your own CPOs. (compact version).
  5. CPO Internals: A small intro guide for developers into the code base. See the info directory for pdf / html versions.

For more documentation of individual mlrCPO functions, use R's built-in help() functionality.

Project Status

The foundation of mlrCPO is built and is reasonably stable, only small improvements and stability fixes are expected here. There are still many concrete implementations of preprocessing operators to be written.

Similar Projects

There are other projects that provide functionality similar to mlrCPO for other machine learning frameworks. The caret project provides some preprocessing functionality, though not as flexible as mlrCPO. dplyr has similar syntax and some overlapping functionality, but is focused ultimately more on (manual) data manipulation instead of (machine learning pipeline integrated) preprocessing. Much more close to mlrCPO's functionality is the Recipes package. scikit learn also has preprocessing functionality built in.

License

The BSD 2-Clause License

Copy Link

Version

Install

install.packages('mlrCPO')

Monthly Downloads

310

Version

0.3.2

License

BSD_2_clause + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Martin Binder

Last Published

April 5th, 2018

Functions in mlrCPO (0.3.2)

cpoFilterFeatures

Filter Features by Thresholding Filter Values
cpoFilterCarscore

Filter Features: “carscore”
cpoFilterOneR

Filter Features: “oneR”
cpoFilterRfSRCMinDepth

Filter Features: “randomForestSRC.var.select”
cpoFilterMrmr

Filter Features: “mrmr”
cpoFilterLinearCorrelation

Filter Features: “linear.correlation”
cpoFilterKruskal

Filter Features: “kruskal.test”
cpoFilterRfSRCImportance

Filter Features: “randomForestSRC.rfsrc”
cpoFilterInformationGain

Filter Features: “information.gain”
cpoAsNumeric

Convert All Features to Numerics
cpoFilterRankCorrelation

Filter Features: “rank.correlation”
cpoFilterRfImportance

Filter Features: “randomForest.importance”
cpoFilterRfCImportance

Filter Features: “cforest.importance”
cpoFilterGainRatio

Filter Features: “gain.ratio”
cpoFilterRelief

Filter Features: “relief”
cpoFilterSymmetricalUncertainty

Filter Features: “symmetrical.uncertainty”
cpoFilterUnivariate

Filter Features: “univariate.model.score”
cpoFilterChiSquared

Filter Features: “chi.squared”
cpoFilterPermutationImportance

Filter Features: “permutation.importance”
cpoFilterVariance

Filter Features: “variance”
cpoFixFactors

Clean Up Factorial Features
cpoImpactEncodeRegr

Impact Encoding
cpoImputeLearner

Perform Imputation with an mlr Learner
cpoImpute

Impute and Re-Impute Data
cpoIca

Construct a CPO for ICA Preprocessing
cpoLogTrafoRegr

Log-Transform a Regression Target Variable.
cpoImputeHist

Perform Imputation with Random Values
cpoImpactEncodeClassif

Impact Encoding
cpoImputeMedian

Perform Imputation with Median Value
cpoMakeCols

Create Columns from Expressions
cpoModelMatrix

Create a “Model Matrix” from the Data Given a Formula
cpoImputeMean

Perform Imputation with Mean Value
cpoImputeNormal

Perform Imputation with Normally Distributed Random Values
cpoMissingIndicators

Convert Data into Factors Indicating Missing Data
cpoSpatialSign

Scale Rows to Unit Length
cpoScaleMaxAbs

Max Abs Scaling CPO
cpoImputeMax

Perform Imputation with Multiple of Minimum
cpoScaleRange

Range Scaling CPO
cpoTemplate

Dummy Function for Documentation Purposes
cpoImputeConstant

Perform Imputation with Constant Value
cpoSelect

Drop All Columns Except Certain Selected Ones from Data
cpoImputeUniform

Perform Imputation with Uniformly Random Values
cpoProbEncode

Probability Encoding
getCPOAffect

Get the Selection Arguments for Affected CPOs
cpoImputeMin

Perform Imputation with Multiple of Minimum
getCPOClass

Get the CPO Class
cpoPca

Construct a CPO for PCA Preprocessing
cpoOversample

Over- or Undersample Binary Classification Tasks
cpoQuantileBinNumerics

Split Numeric Features into Quantile Bins
cpoSample

Sample Data from a Task
discrete

defined to avoid problems with the static type checker
cpoScale

Construct a CPO for Scaling / Centering
cpoImputeMode

Perform Imputation with Mode Value
cpoSmote

Perform SMOTE Oversampling for Binary Classification
getCPOName

Get the CPO Object's Name
getCPOPredictType

Get the CPO predict.type
cpoRegrResiduals

Train a Model on a Task and Return the Residual Task
cpoResponseFromSE

Use the “se” predict.type for “response” Prediction
getCPOProperties

Get the Properties of the Given CPO Object
cpoTransformParams

Transform CPO Hyperparameters
getCPOOperatingType

Determine the Operating Type of the CPO
cpoWrap

CPO Wrapper
getCPOTrainedCapability

Get the CPOTrained's Capabilities
identicalCPO

Check Whether Two CPO are Fundamentally the Same
getCPOTrainedCPO

Get CPO Used to Train a Retrafo / Inverter
funct

defined to avoid problems with the static type checker
getLearnerCPO

Get the CPO Associated with a Learner
getCPOConstructor

Get the CPOConstructor Used to Create a CPO Object
getCPOId

Get the ID of a CPO Object
invert

Invert Target Preprocessing
%>>%

CPO Composition / Attachment / Application Operator
internal%>>%

Internally Used %>>% Operators
listCPO

List all Built-in CPOs
nullToNullcpo

NULL to NULLCPO
makeCPO

Create a Custom CPO Constructor
is.nullcpo

Check for NULLCPO
print.CPOConstructor

Print CPO Objects
setCPOId

Set the ID of a CPO Object
is.inverter

Check CPOInverter
is.retrafo

Check CPORetrafo
nullcpoToNull

NULLCPO to NULL
getLearnerBare

Get the Learner with the CPOs Removed
getCPOTrainedState

Get the Internal State of a CPORetrafo Object
untyped

defined to avoid problems with the static type checker
makeCPOTrainedFromState

Create a CPOTrained with Given Internal State
mlrCPO-package

Composable Preprocessing Operators
makeCPOMultiplex

CPO Multiplexer
pSS

Turn the argument list into a ParamSet
pipeCPO

Turn a list of CPOs into a Single Chained One
makeCPOCase

Build Data-Dependent CPOs
attachCPO

Attach a CPO to a Learner
applyCPO

Apply a CPO to Data
composeCPO

CPO Composition
CPOConstructor

Constructor for CPO Objects
CPO

Composable Preprocessing Operators
as.list.CPO

Split a Pipeline into Its Constituents
CPOLearner

CPO Learner Object
clearRI

Clear Retrafo and Inverter Attributes
NULLCPO

CPO Composition Neutral Element
CPOTrained

Get the Retransformation or Inversion Function from a Resulting Object
covrTraceCPOs

Add 'covr' coverage to CPOs
cpoCollapseFact

Compine Rare Factors
cpoCache

Caches the Result of CPO Transformations
cpoApplyFunRegrTarget

Transform a Regression Target Variable
cpoCbind

“cbind” the Result of Multiple CPOs
cpoDummyEncode

CPO Dummy Encoder
cpoApplyFun

Apply a Function Element-Wise
cpoFilterAnova

Filter Features: “anova.test”
cpoDropConstants

Drop Constant or Near-Constant Features