mlrCPO v0.3.7-2

0

Monthly downloads

0th

Percentile

Composable Preprocessing Operators and Pipelines for Machine Learning

Toolset that enriches 'mlr' with a diverse set of preprocessing operators. Composable Preprocessing Operators ("CPO"s) are first-class R objects that can be applied to data.frames and 'mlr' "Task"s to modify data, can be attached to 'mlr' "Learner"s to add preprocessing to machine learning algorithms, and can be composed to form preprocessing pipelines.

Readme

Build Status Coverage CRAN Status Badge CRAN Downloads

mlrCPO: Composable Preprocessing Operators for mlr

GSoC 2017 Project: Operator Based Machine Learning Pipeline Construction

What is CPO?

> task = iris.task
> task %<>>% cpoScale(scale = FALSE) %>>% cpoPca() %>>%  # pca
>   cpoFilterChiSquared(abs = 3) %>>%  # filter
>   cpoModelMatrix(~ 0 + .^2)  # interactions
> head(getTaskData(task))
        PC1        PC2         PC3    PC1:PC2     PC1:PC3      PC2:PC3 Species
1 -2.684126 -0.3193972  0.02791483  0.8573023 -0.07492690 -0.008915919  setosa
2 -2.714142  0.1770012  0.21046427 -0.4804064 -0.57122986  0.037252434  setosa
3 -2.888991  0.1449494 -0.01790026 -0.4187575  0.05171367 -0.002594632  setosa
4 -2.745343  0.3182990 -0.03155937 -0.8738398  0.08664130 -0.010045316  setosa
5 -2.728717 -0.3267545 -0.09007924  0.8916204  0.24580071  0.029433798  setosa
6 -2.280860 -0.7413304 -0.16867766  1.6908707  0.38473006  0.125045884  setosa

"Composable Preprocessing Operators" are an extension for the mlr ("Machine Learning in R") project which represent preprocessing operations (e.g. imputation or PCA) in the form of R objects. These CPO objects can be composed to form more complex operations, they can be applied to data sets, and they can be attached to mlr Learner objects to generate complex machine learning pipelines that perform both preprocessing and model fitting.

Table of Contents

Short Overview

CPOs are created by calling a constructor.

> cpoScale()
scale(center = TRUE, scale = TRUE)

The created objects have Hyperparameters that can be manipulated using getHyperPars, setHyperPars etc, just like in mlr.

> getHyperPars(cpoScale())
$scale.center
[1] TRUE

$scale.scale
[1] TRUE

> setHyperPars(cpoScale(), scale.center = FALSE)
scale(center = FALSE, scale = TRUE)

The %>>%-operator can be used to create complex pipelines.

> cpoScale() %>>% cpoPca()
(scale >> pca)(scale.center = TRUE, scale.scale = TRUE)

This operator can also be used to apply an operation to a data set:

> head(iris %>>% cpoPca())
  Species       PC1      PC2          PC3          PC4
1  setosa -5.912747 2.302033  0.007401536  0.003087706
2  setosa -5.572482 1.971826  0.244592251  0.097552888
3  setosa -5.446977 2.095206  0.015029262  0.018013331
4  setosa -5.436459 1.870382  0.020504880 -0.078491501
5  setosa -5.875645 2.328290 -0.110338269 -0.060719326
6  setosa -6.477598 2.324650 -0.237202487 -0.021419633

Or to attach an operation to an MLR Learner, which extends the Learner's hyperparameters by the CPO's hyperparameters:

> cpoScale() %>>% makeLearner("classif.logreg")
Learner classif.logreg.scale from package stats
Type: classif
Name: ; Short name: 
Class: CPOLearner
Properties: numerics,factors,prob,twoclass
Predict-Type: response
Hyperparameters: model=FALSE,scale.center=TRUE,scale.scale=TRUE

Get a list of all CPOs by calling listCPO().

Installation

Install mlrCPO from CRAN, or use the more recent GitHub version:

devtools::install_github("mlr-org/mlrCPO")

Documentation

To effectively use mlrCPO, you should first familiarize yourself a little with mlr. There is an extensive tutorial online; for more resources on mlr, see the overview on mlr's GitHub page.

To get familiar with mlrCPO, it is recommended that you read the vignettes. For each vignette, there is also a compact version that has all the R output removed.

  1. First Steps: Introduction and short overview (compact version).
  2. mlrCPO Core: Description of general tools for CPO handling (compact version).
  3. Builtin CPOs: Listing and description of all builtin CPOs (compact version).
  4. Custom CPOs: How to create your own CPOs. (compact version).
  5. CPO Internals: A small intro guide for developers into the code base. See the info directory for pdf / html versions.

For more documentation of individual mlrCPO functions, use R's built-in help() functionality.

Project Status

The foundation of mlrCPO is built and is reasonably stable, only small improvements and stability fixes are expected here. There are still many concrete implementations of preprocessing operators to be written.

Contributing

Bugs, Questions, Feedback

mlrCPO is a free and open source software project that encourages participation and feedback. If you have any issues, questions, suggestions or feedback, please do not hesitate to open an "issue" about it on the GitHub page!

In case of problems / bugs, it is often helpful if you provide a "minimum working example" that showcases the behaviour (but don't worry about this if the bug is obvious).

Please understand that the resources of the project are limited: response may sometimes be delayed by a few days, and some suggestions may not not make it to become features for a while.

Contributing Code, Pull Requests

Pull Requests that fix small issues are very welcome, especially if they contain tests that check for the given issue. For larger contributions, or Pull Requests that add features, please note:

  1. Adding new CPOs is always welcome. Please have a look at a few examples in the current codebase (the PCA CPO and the corresponding tests file are good for this, and show that adding a CPO does not require a lot of code) to familiarise yourself with the conventions. A CPO that comes with documentation, in particular also documenting the CPOTrained state, and with tests, is most likely to get merged quickly.

  2. Adding or changing features of the backend, or changing the functioning of the backend, is a more complicated story. If a Pull Request is incongruent with the "vision" behind mlrCPO, or if it appears to put a large burden on the mlrCPO developers in the long term relative to the problems it solves, it may have a slim chance of getting merged. Therefore, if you plan to make a contribution changing CPO core behaviour, it is best if you first open an "issue" about it for discussion.

When creating Pull Requests, please follow the Style Guide. Adherence to this is checked by the CI system (Travis). On Linux (and possibly Mac) you can check this locally on your computer using the quicklint tool in the tools directory. This is recommended to avoid frustrating failed builds caused by style violations.

Before merging a Pull Request, it is possible that an mlrCPO developer makes further changes to it, e.g. to harmonise it with conventions, or to incorporate other ideas.

When you make a Pull Request, it is assumed that you permit us (and are able to permit us) to incorporate the given code into the mlrCPO codebase as given, or with modifications, and distribute the result under the BSD 2-Clause License.

Similar Projects

There are other projects that provide functionality similar to mlrCPO for other machine learning frameworks. The caret project provides some preprocessing functionality, though not as flexible as mlrCPO. dplyr has similar syntax and some overlapping functionality, but is focused ultimately more on (manual) data manipulation instead of (machine learning pipeline integrated) preprocessing. Much more close to mlrCPO's functionality is the Recipes package. scikit learn also has preprocessing functionality built in.

License

The BSD 2-Clause License

Functions in mlrCPO

Name Description
clearRI Clear Retrafo and Inverter Attributes
attachCPO Attach a CPO to a Learner
CPOConstructor Constructor for CPO Objects
applyCPO Apply a CPO to Data
CPO Composable Preprocessing Operators
NULLCPO CPO Composition Neutral Element
CPOLearner CPO Learner Object
as.list.CPO Split a Pipeline into Its Constituents
CPOTrained Get the Retransformation or Inversion Function from a Resulting Object
composeCPO CPO Composition
covrTraceCPOs Add 'covr' coverage to CPOs
cpoDropConstants Drop Constant or Near-Constant Features
cpoCollapseFact Compine Rare Factors
cpoCache Caches the Result of CPO Transformations
cpoCbind “cbind” the Result of Multiple CPOs
cpoApplyFun Apply a Function Element-Wise
cpoAsNumeric Convert All Features to Numerics
cpoApplyFunRegrTarget Transform a Regression Target Variable
cpoDropMostlyConstants Drop Constant or Near-Constant Features
cpoDummyEncode CPO Dummy Encoder
cpoFilterCarscore Filter Features: “carscore”
cpoFilterGainRatio Filter Features: “gain.ratio”
cpoFilterFeatures Filter Features by Thresholding Filter Values
cpoFilterOneR Filter Features: “oneR”
cpoFilterAnova Filter Features: “anova.test”
cpoFilterChiSquared Filter Features: “chi.squared”
cpoFilterMrmr Filter Features: “mrmr”
cpoFilterKruskal Filter Features: “kruskal.test”
cpoFilterLinearCorrelation Filter Features: “linear.correlation”
cpoFilterInformationGain Filter Features: “information.gain”
cpoFilterRfImportance Filter Features: “randomForest.importance”
cpoFilterSymmetricalUncertainty Filter Features: “symmetrical.uncertainty”
cpoFilterRfSRCMinDepth Filter Features: “randomForestSRC.var.select”
cpoFilterRelief Filter Features: “relief”
cpoFilterRfCImportance Filter Features: “cforest.importance”
cpoFilterVariance Filter Features: “variance”
cpoFilterUnivariate Filter Features: “univariate.model.score”
cpoFilterRfSRCImportance Filter Features: “randomForestSRC.rfsrc”
cpoFilterRankCorrelation Filter Features: “rank.correlation”
cpoFilterPermutationImportance Filter Features: “permutation.importance”
cpoImpactEncodeClassif Impact Encoding
cpoImpactEncodeRegr Impact Encoding
cpoIca Construct a CPO for ICA Preprocessing
cpoFixFactors Clean Up Factorial Features
cpoImputeHist Perform Imputation with Random Values
cpoImputeLearner Perform Imputation with an mlr Learner
cpoImputeConstant Perform Imputation with Constant Value
cpoImpute Impute and Re-Impute Data
cpoImputeMax Perform Imputation with Multiple of Minimum
cpoImputeMean Perform Imputation with Mean Value
cpoImputeNormal Perform Imputation with Normally Distributed Random Values
cpoLogTrafoRegr Log-Transform a Regression Target Variable.
cpoImputeUniform Perform Imputation with Uniformly Random Values
cpoImputeMode Perform Imputation with Mode Value
cpoMakeCols Create Columns from Expressions
cpoMissingIndicators Convert Data into Factors Indicating Missing Data
cpoImputeMin Perform Imputation with Multiple of Minimum
cpoImputeMedian Perform Imputation with Median Value
cpoModelMatrix Create a “Model Matrix” from the Data Given a Formula
cpoOversample Over- or Undersample Binary Classification Tasks
cpoProbEncode Probability Encoding
cpoResponseFromSE Use the “se” predict.type for “response” Prediction
cpoScaleRange Range Scaling CPO
cpoRegrResiduals Train a Model on a Task and Return the Residual Task
cpoPca Construct a CPO for PCA Preprocessing
cpoQuantileBinNumerics Split Numeric Features into Quantile Bins
cpoSample Sample Data from a Task
cpoScaleMaxAbs Max Abs Scaling CPO
cpoScale Construct a CPO for Scaling / Centering
cpoSelect Drop All Columns Except Certain Selected Ones from Data
cpoSpatialSign Scale Rows to Unit Length
cpoSmote Perform SMOTE Oversampling for Binary Classification
cpoWrap CPO Wrapper
discrete defined to avoid problems with the static type checker
getCPOPredictType Get the CPO predict.type
getCPOConstructor Get the CPOConstructor Used to Create a CPO Object
getLearnerCPO Get the CPO Associated with a Learner
getCPOClass Get the CPO Class
getLearnerBare Get the Learner with the CPOs Removed
getCPOOperatingType Determine the Operating Type of the CPO
funct defined to avoid problems with the static type checker
getCPOAffect Get the Selection Arguments for Affected CPOs
getCPOTrainedState Get the Internal State of a CPORetrafo Object
getCPOId Get the ID of a CPO Object
getCPOTrainedCapability Get the CPOTrained's Capabilities
getCPOName Get the CPO Object's Name
cpoTemplate Dummy Function for Documentation Purposes
cpoTransformParams Transform CPO Hyperparameters
getCPOProperties Get the Properties of the Given CPO Object
getCPOTrainedCPO Get CPO Used to Train a Retrafo / Inverter
makeCPOCase Build Data-Dependent CPOs
makeCPO Create a Custom CPO Constructor
is.nullcpo Check for NULLCPO
makeCPOTrainedFromState Create a CPOTrained with Given Internal State
is.inverter Check CPOInverter
makeCPOMultiplex CPO Multiplexer
pipeCPO Turn a list of CPOs into a Single Chained One
%>>% CPO Composition / Attachment / Application Operator
identicalCPO Check Whether Two CPO are Fundamentally the Same
print.CPOConstructor Print CPO Objects
mlrCPO-package Composable Preprocessing Operators
listCPO List all Built-in CPOs
invert Invert Target Preprocessing
setCPOId Set the ID of a CPO Object
internal%>>% Internally Used %>>% Operators
untyped defined to avoid problems with the static type checker
nullcpoToNull NULLCPO to NULL
pSS Turn the argument list into a ParamSet
is.retrafo Check CPORetrafo
nullToNullcpo NULL to NULLCPO
No Results!

Vignettes of mlrCPO

Name
toc/vignettetoc.Rmd
a_1_getting_started.Rmd
a_2_mlrCPO_core.Rmd
a_3_all_CPOs.Rmd
a_4_custom_CPOs.Rmd
z_1_getting_started_terse.Rmd
z_2_mlrCPO_core_terse.Rmd
z_3_all_CPOs_terse.Rmd
z_4_custom_CPOs_terse.Rmd
No Results!

Last month downloads

Details

URL https://github.com/mlr-org/mlrCPO
BugReports https://github.com/mlr-org/mlrCPO/issues
License BSD_2_clause + file LICENSE
Encoding UTF-8
LazyData yes
Config/testthat/edition 3
Config/testthat/parallel true
ByteCompile yes
Collate 'CPOHelp.R' 'fauxCPOConstructor.R' 'auxiliary.R' 'ParamSetSugar.R' 'callInterface.R' 'FormatCheck.R' 'callCPO.R' 'properties.R' 'parameters.R' 'listCPO.R' 'makeCPO.R' 'CPO_applyFun.R' 'CPO_asNumeric.R' 'operators.R' 'NULLCPO.R' 'CPO_meta.R' 'CPO_cbind.R' 'CPO_collapseFact.R' 'CPO_dropConstants.R' 'CPO_dropMostlyConstants.R' 'CPO_encode.R' 'CPO_filterFeatures.R' 'CPO_fixFactors.R' 'CPO_ica.R' 'CPO_impute.R' 'CPO_makeCols.R' 'CPO_missingIndicators.R' 'CPO_modelMatrix.R' 'CPO_pca.R' 'CPO_quantileBinNumerics.R' 'CPO_regrResiduals.R' 'CPO_responseFromSE.R' 'CPO_scale.R' 'CPO_scaleMaxAbs.R' 'CPO_scaleRange.R' 'CPO_select.R' 'CPO_smote.R' 'CPO_spatialSign.R' 'CPO_subsample.R' 'CPO_wrap.R' 'RetrafoState.R' 'attributes.R' 'auxhelp.R' 'composeProperties.R' 'doublecaret.R' 'inverter.R' 'learner.R' 'makeCPOHelp.R' 'print.R' 'zzz.R'
RoxygenNote 7.1.1
VignetteBuilder knitr
NeedsCompilation no
Packaged 2021-02-24 21:42:45 UTC; user
Repository CRAN
Date/Publication 2021-02-24 22:40:06 UTC

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/mlrCPO)](http://www.rdocumentation.org/packages/mlrCPO)