mlrCPO v0.3.7-2
Monthly downloads
Composable Preprocessing Operators and Pipelines for Machine Learning
Toolset that enriches 'mlr' with a diverse set of preprocessing
operators. Composable Preprocessing Operators ("CPO"s) are first-class
R objects that can be applied to data.frames and 'mlr' "Task"s to modify
data, can be attached to 'mlr' "Learner"s to add preprocessing to machine
learning algorithms, and can be composed to form preprocessing pipelines.
Readme
mlrCPO: Composable Preprocessing Operators for mlr
GSoC 2017 Project: Operator Based Machine Learning Pipeline Construction
What is CPO?
> task = iris.task
> task %<>>% cpoScale(scale = FALSE) %>>% cpoPca() %>>% # pca
> cpoFilterChiSquared(abs = 3) %>>% # filter
> cpoModelMatrix(~ 0 + .^2) # interactions
> head(getTaskData(task))
PC1 PC2 PC3 PC1:PC2 PC1:PC3 PC2:PC3 Species
1 -2.684126 -0.3193972 0.02791483 0.8573023 -0.07492690 -0.008915919 setosa
2 -2.714142 0.1770012 0.21046427 -0.4804064 -0.57122986 0.037252434 setosa
3 -2.888991 0.1449494 -0.01790026 -0.4187575 0.05171367 -0.002594632 setosa
4 -2.745343 0.3182990 -0.03155937 -0.8738398 0.08664130 -0.010045316 setosa
5 -2.728717 -0.3267545 -0.09007924 0.8916204 0.24580071 0.029433798 setosa
6 -2.280860 -0.7413304 -0.16867766 1.6908707 0.38473006 0.125045884 setosa
"Composable Preprocessing Operators" are an extension for the mlr ("Machine Learning in R") project which represent preprocessing operations (e.g. imputation or PCA) in the form of R objects. These CPO objects can be composed to form more complex operations, they can be applied to data sets, and they can be attached to mlr Learner
objects to generate complex machine learning pipelines that perform both preprocessing and model fitting.
Table of Contents
Short Overview
CPOs are created by calling a constructor.
> cpoScale()
scale(center = TRUE, scale = TRUE)
The created objects have Hyperparameters that can be manipulated using getHyperPars
, setHyperPars
etc, just like in mlr
.
> getHyperPars(cpoScale())
$scale.center
[1] TRUE
$scale.scale
[1] TRUE
> setHyperPars(cpoScale(), scale.center = FALSE)
scale(center = FALSE, scale = TRUE)
The %>>%
-operator can be used to create complex pipelines.
> cpoScale() %>>% cpoPca()
(scale >> pca)(scale.center = TRUE, scale.scale = TRUE)
This operator can also be used to apply an operation to a data set:
> head(iris %>>% cpoPca())
Species PC1 PC2 PC3 PC4
1 setosa -5.912747 2.302033 0.007401536 0.003087706
2 setosa -5.572482 1.971826 0.244592251 0.097552888
3 setosa -5.446977 2.095206 0.015029262 0.018013331
4 setosa -5.436459 1.870382 0.020504880 -0.078491501
5 setosa -5.875645 2.328290 -0.110338269 -0.060719326
6 setosa -6.477598 2.324650 -0.237202487 -0.021419633
Or to attach an operation to an MLR Learner
, which extends the Learner's hyperparameters by the CPO's hyperparameters:
> cpoScale() %>>% makeLearner("classif.logreg")
Learner classif.logreg.scale from package stats
Type: classif
Name: ; Short name:
Class: CPOLearner
Properties: numerics,factors,prob,twoclass
Predict-Type: response
Hyperparameters: model=FALSE,scale.center=TRUE,scale.scale=TRUE
Get a list of all CPO
s by calling listCPO()
.
Installation
Install mlrCPO
from CRAN, or use the more recent GitHub version:
devtools::install_github("mlr-org/mlrCPO")
Documentation
To effectively use mlrCPO
, you should first familiarize yourself a little with mlr
. There is an extensive tutorial online; for more resources on mlr
, see the overview on mlr
's GitHub page.
To get familiar with mlrCPO
, it is recommended that you read the vignettes. For each vignette, there is also a compact version that has all the R output removed.
- First Steps: Introduction and short overview (compact version).
- mlrCPO Core: Description of general tools for
CPO
handling (compact version). - Builtin CPOs: Listing and description of all builtin
CPO
s (compact version). - Custom CPOs: How to create your own
CPO
s. (compact version). - CPO Internals: A small intro guide for developers into the code base. See the
info
directory for pdf / html versions.
For more documentation of individual mlrCPO
functions, use R's built-in help()
functionality.
Project Status
The foundation of mlrCPO
is built and is reasonably stable, only small improvements and stability fixes are expected here. There are still many concrete implementations of preprocessing operators to be written.
Contributing
Bugs, Questions, Feedback
mlrCPO
is a free and open source software project that encourages participation and feedback. If you have any issues, questions, suggestions or feedback, please do not hesitate to open an "issue" about it on the GitHub page!
In case of problems / bugs, it is often helpful if you provide a "minimum working example" that showcases the behaviour (but don't worry about this if the bug is obvious).
Please understand that the resources of the project are limited: response may sometimes be delayed by a few days, and some suggestions may not not make it to become features for a while.
Contributing Code, Pull Requests
Pull Requests that fix small issues are very welcome, especially if they contain tests that check for the given issue. For larger contributions, or Pull Requests that add features, please note:
Adding new
CPO
s is always welcome. Please have a look at a few examples in the current codebase (the PCA CPO and the corresponding tests file are good for this, and show that adding a CPO does not require a lot of code) to familiarise yourself with the conventions. ACPO
that comes with documentation, in particular also documenting theCPOTrained
state, and with tests, is most likely to get merged quickly.Adding or changing features of the backend, or changing the functioning of the backend, is a more complicated story. If a Pull Request is incongruent with the "vision" behind
mlrCPO
, or if it appears to put a large burden on themlrCPO
developers in the long term relative to the problems it solves, it may have a slim chance of getting merged. Therefore, if you plan to make a contribution changingCPO
core behaviour, it is best if you first open an "issue" about it for discussion.
When creating Pull Requests, please follow the Style Guide. Adherence to this is checked by the CI system (Travis). On Linux (and possibly Mac) you can check this locally on your computer using the quicklint
tool in the tools
directory. This is recommended to avoid frustrating failed builds caused by style violations.
Before merging a Pull Request, it is possible that an mlrCPO
developer makes further changes to it, e.g. to harmonise it with conventions, or to incorporate other ideas.
When you make a Pull Request, it is assumed that you permit us (and are able to permit us) to incorporate the given code into the mlrCPO
codebase as given, or with modifications, and distribute the result under the BSD 2-Clause License.
Similar Projects
There are other projects that provide functionality similar to mlrCPO
for other machine learning frameworks. The caret project provides some preprocessing functionality, though not as flexible as mlrCPO
. dplyr has similar syntax and some overlapping functionality, but is focused ultimately more on (manual) data manipulation instead of (machine learning pipeline integrated) preprocessing. Much more close to mlrCPO
's functionality is the Recipes package. scikit learn also has preprocessing functionality built in.
License
The BSD 2-Clause License
Functions in mlrCPO
Name | Description | |
clearRI | Clear Retrafo and Inverter Attributes | |
attachCPO | Attach a CPO to a Learner | |
CPOConstructor | Constructor for CPO Objects | |
applyCPO | Apply a CPO to Data | |
CPO | Composable Preprocessing Operators | |
NULLCPO | CPO Composition Neutral Element | |
CPOLearner | CPO Learner Object | |
as.list.CPO | Split a Pipeline into Its Constituents | |
CPOTrained | Get the Retransformation or Inversion Function from a Resulting Object | |
composeCPO | CPO Composition | |
covrTraceCPOs | Add 'covr' coverage to CPOs | |
cpoDropConstants | Drop Constant or Near-Constant Features | |
cpoCollapseFact | Compine Rare Factors | |
cpoCache | Caches the Result of CPO Transformations | |
cpoCbind | “cbind” the Result of Multiple CPOs | |
cpoApplyFun | Apply a Function Element-Wise | |
cpoAsNumeric | Convert All Features to Numerics | |
cpoApplyFunRegrTarget | Transform a Regression Target Variable | |
cpoDropMostlyConstants | Drop Constant or Near-Constant Features | |
cpoDummyEncode | CPO Dummy Encoder | |
cpoFilterCarscore | Filter Features: “carscore” | |
cpoFilterGainRatio | Filter Features: “gain.ratio” | |
cpoFilterFeatures | Filter Features by Thresholding Filter Values | |
cpoFilterOneR | Filter Features: “oneR” | |
cpoFilterAnova | Filter Features: “anova.test” | |
cpoFilterChiSquared | Filter Features: “chi.squared” | |
cpoFilterMrmr | Filter Features: “mrmr” | |
cpoFilterKruskal | Filter Features: “kruskal.test” | |
cpoFilterLinearCorrelation | Filter Features: “linear.correlation” | |
cpoFilterInformationGain | Filter Features: “information.gain” | |
cpoFilterRfImportance | Filter Features: “randomForest.importance” | |
cpoFilterSymmetricalUncertainty | Filter Features: “symmetrical.uncertainty” | |
cpoFilterRfSRCMinDepth | Filter Features: “randomForestSRC.var.select” | |
cpoFilterRelief | Filter Features: “relief” | |
cpoFilterRfCImportance | Filter Features: “cforest.importance” | |
cpoFilterVariance | Filter Features: “variance” | |
cpoFilterUnivariate | Filter Features: “univariate.model.score” | |
cpoFilterRfSRCImportance | Filter Features: “randomForestSRC.rfsrc” | |
cpoFilterRankCorrelation | Filter Features: “rank.correlation” | |
cpoFilterPermutationImportance | Filter Features: “permutation.importance” | |
cpoImpactEncodeClassif | Impact Encoding | |
cpoImpactEncodeRegr | Impact Encoding | |
cpoIca | Construct a CPO for ICA Preprocessing | |
cpoFixFactors | Clean Up Factorial Features | |
cpoImputeHist | Perform Imputation with Random Values | |
cpoImputeLearner | Perform Imputation with an mlr Learner | |
cpoImputeConstant | Perform Imputation with Constant Value | |
cpoImpute | Impute and Re-Impute Data | |
cpoImputeMax | Perform Imputation with Multiple of Minimum | |
cpoImputeMean | Perform Imputation with Mean Value | |
cpoImputeNormal | Perform Imputation with Normally Distributed Random Values | |
cpoLogTrafoRegr | Log-Transform a Regression Target Variable. | |
cpoImputeUniform | Perform Imputation with Uniformly Random Values | |
cpoImputeMode | Perform Imputation with Mode Value | |
cpoMakeCols | Create Columns from Expressions | |
cpoMissingIndicators | Convert Data into Factors Indicating Missing Data | |
cpoImputeMin | Perform Imputation with Multiple of Minimum | |
cpoImputeMedian | Perform Imputation with Median Value | |
cpoModelMatrix | Create a “Model Matrix” from the Data Given a Formula | |
cpoOversample | Over- or Undersample Binary Classification Tasks | |
cpoProbEncode | Probability Encoding | |
cpoResponseFromSE | Use the “se” predict.type for “response” Prediction | |
cpoScaleRange | Range Scaling CPO | |
cpoRegrResiduals | Train a Model on a Task and Return the Residual Task | |
cpoPca | Construct a CPO for PCA Preprocessing | |
cpoQuantileBinNumerics | Split Numeric Features into Quantile Bins | |
cpoSample | Sample Data from a Task | |
cpoScaleMaxAbs | Max Abs Scaling CPO | |
cpoScale | Construct a CPO for Scaling / Centering | |
cpoSelect | Drop All Columns Except Certain Selected Ones from Data | |
cpoSpatialSign | Scale Rows to Unit Length | |
cpoSmote | Perform SMOTE Oversampling for Binary Classification | |
cpoWrap | CPO Wrapper | |
discrete | defined to avoid problems with the static type checker | |
getCPOPredictType | Get the CPO predict.type | |
getCPOConstructor | Get the CPOConstructor Used to Create a CPO Object | |
getLearnerCPO | Get the CPO Associated with a Learner | |
getCPOClass | Get the CPO Class | |
getLearnerBare | Get the Learner with the CPOs Removed | |
getCPOOperatingType | Determine the Operating Type of the CPO | |
funct | defined to avoid problems with the static type checker | |
getCPOAffect | Get the Selection Arguments for Affected CPOs | |
getCPOTrainedState | Get the Internal State of a CPORetrafo Object | |
getCPOId | Get the ID of a CPO Object | |
getCPOTrainedCapability | Get the CPOTrained's Capabilities | |
getCPOName | Get the CPO Object's Name | |
cpoTemplate | Dummy Function for Documentation Purposes | |
cpoTransformParams | Transform CPO Hyperparameters | |
getCPOProperties | Get the Properties of the Given CPO Object | |
getCPOTrainedCPO | Get CPO Used to Train a Retrafo / Inverter | |
makeCPOCase | Build Data-Dependent CPOs | |
makeCPO | Create a Custom CPO Constructor | |
is.nullcpo | Check for NULLCPO | |
makeCPOTrainedFromState | Create a CPOTrained with Given Internal State | |
is.inverter | Check CPOInverter | |
makeCPOMultiplex | CPO Multiplexer | |
pipeCPO | Turn a list of CPOs into a Single Chained One | |
%>>% | CPO Composition / Attachment / Application Operator | |
identicalCPO | Check Whether Two CPO are Fundamentally the Same | |
print.CPOConstructor | Print CPO Objects | |
mlrCPO-package | Composable Preprocessing Operators | |
listCPO | List all Built-in CPOs | |
invert | Invert Target Preprocessing | |
setCPOId | Set the ID of a CPO Object | |
internal%>>% | Internally Used %>>% Operators | |
untyped | defined to avoid problems with the static type checker | |
nullcpoToNull | NULLCPO to NULL | |
pSS | Turn the argument list into a ParamSet | |
is.retrafo | Check CPORetrafo | |
nullToNullcpo | NULL to NULLCPO | |
No Results! |
Vignettes of mlrCPO
Last month downloads
Details
URL | https://github.com/mlr-org/mlrCPO |
BugReports | https://github.com/mlr-org/mlrCPO/issues |
License | BSD_2_clause + file LICENSE |
Encoding | UTF-8 |
LazyData | yes |
Config/testthat/edition | 3 |
Config/testthat/parallel | true |
ByteCompile | yes |
Collate | 'CPOHelp.R' 'fauxCPOConstructor.R' 'auxiliary.R' 'ParamSetSugar.R' 'callInterface.R' 'FormatCheck.R' 'callCPO.R' 'properties.R' 'parameters.R' 'listCPO.R' 'makeCPO.R' 'CPO_applyFun.R' 'CPO_asNumeric.R' 'operators.R' 'NULLCPO.R' 'CPO_meta.R' 'CPO_cbind.R' 'CPO_collapseFact.R' 'CPO_dropConstants.R' 'CPO_dropMostlyConstants.R' 'CPO_encode.R' 'CPO_filterFeatures.R' 'CPO_fixFactors.R' 'CPO_ica.R' 'CPO_impute.R' 'CPO_makeCols.R' 'CPO_missingIndicators.R' 'CPO_modelMatrix.R' 'CPO_pca.R' 'CPO_quantileBinNumerics.R' 'CPO_regrResiduals.R' 'CPO_responseFromSE.R' 'CPO_scale.R' 'CPO_scaleMaxAbs.R' 'CPO_scaleRange.R' 'CPO_select.R' 'CPO_smote.R' 'CPO_spatialSign.R' 'CPO_subsample.R' 'CPO_wrap.R' 'RetrafoState.R' 'attributes.R' 'auxhelp.R' 'composeProperties.R' 'doublecaret.R' 'inverter.R' 'learner.R' 'makeCPOHelp.R' 'print.R' 'zzz.R' |
RoxygenNote | 7.1.1 |
VignetteBuilder | knitr |
NeedsCompilation | no |
Packaged | 2021-02-24 21:42:45 UTC; user |
Repository | CRAN |
Date/Publication | 2021-02-24 22:40:06 UTC |
imports | backports (>= 1.1.0) , BBmisc (>= 1.11) , checkmate (>= 1.8.3) , methods , stats , stringi , utils |
suggests | care , digest , DiscriMiner , e1071 , fastICA , FSelector , FSelectorRcpp , Hmisc , knitr , lintr , mlbench , mRMRe , party , praznik , randomForest , randomForestSRC , ranger (>= 0.8.0) , rex , Rfast , rmarkdown , rpart , testthat |
depends | mlr (>= 2.12) , ParamHelpers (>= 1.10) , R (>= 3.0.2) |
Contributors | Michel Lang, Lars Kotthoff, Bernd Bischl |
Include our badge in your README
[](http://www.rdocumentation.org/packages/mlrCPO)