makeCPO: Create a Custom CPO Constructor

Description

makeCPO creates a Feature Operation CPOConstructor, i.e. a constructor for a CPO that will operate on feature columns. makeCPOTargetOp creates a Target Operation CPOConstructor, which creates CPOs that operate on the target column. makeCPORetrafoless creates a Retrafoless CPOConstructor, which creates CPOs that may operate on both feature and target columns, but have no retrafo operation. See OperatingType for further details on the distinction of these. makeCPOExtendedTrafo creates a Feature Operation CPOConstructor that has slightly more flexibility in its data transformation behaviour than makeCPO (but is otherwise identical). makeCPOExtendedTargetOp creates a Target Operation CPOConstructor that has slightly more flexibility in its data transformation behaviour than makeCPOTargetOp but is otherwise identical.

See example section for some simple custom CPO.

Usage

makeCPO(cpo.name, par.set = makeParamSet(), par.vals = NULL,
  dataformat = c("df.features", "split", "df.all", "task", "factor",
  "ordered", "numeric"), dataformat.factor.with.ordered = TRUE,
  export.params = TRUE, fix.factors = FALSE,
  properties.data = c("numerics", "factors", "ordered", "missings"),
  properties.adding = character(0), properties.needed = character(0),
  properties.target = c("cluster", "classif", "multilabel", "regr", "surv",
  "oneclass", "twoclass", "multiclass"), packages = character(0), cpo.train,
  cpo.retrafo)
makeCPOExtendedTrafo(cpo.name, par.set = makeParamSet(), par.vals = NULL,
  dataformat = c("df.features", "split", "df.all", "task", "factor",
  "ordered", "numeric"), dataformat.factor.with.ordered = TRUE,
  export.params = TRUE, fix.factors = FALSE,
  properties.data = c("numerics", "factors", "ordered", "missings"),
  properties.adding = character(0), properties.needed = character(0),
  properties.target = c("cluster", "classif", "multilabel", "regr", "surv",
  "oneclass", "twoclass", "multiclass"), packages = character(0), cpo.trafo,
  cpo.retrafo)
makeCPORetrafoless(cpo.name, par.set = makeParamSet(), par.vals = NULL,
  dataformat = c("df.all", "task"), dataformat.factor.with.ordered = TRUE,
  export.params = TRUE, fix.factors = FALSE,
  properties.data = c("numerics", "factors", "ordered", "missings"),
  properties.adding = character(0), properties.needed = character(0),
  properties.target = c("cluster", "classif", "multilabel", "regr", "surv",
  "oneclass", "twoclass", "multiclass"), packages = character(0), cpo.trafo)
makeCPOTargetOp(cpo.name, par.set = makeParamSet(), par.vals = NULL,
  dataformat = c("df.features", "split", "df.all", "task", "factor",
  "ordered", "numeric"), dataformat.factor.with.ordered = TRUE,
  export.params = TRUE, fix.factors = FALSE,
  properties.data = c("numerics", "factors", "ordered", "missings"),
  properties.adding = character(0), properties.needed = character(0),
  properties.target = "cluster", task.type.out = NULL,
  predict.type.map = c(response = "response"), packages = character(0),
  constant.invert = FALSE, cpo.train, cpo.retrafo, cpo.train.invert,
  cpo.invert)
makeCPOExtendedTargetOp(cpo.name, par.set = makeParamSet(), par.vals = NULL,
  dataformat = c("df.features", "split", "df.all", "task", "factor",
  "ordered", "numeric"), dataformat.factor.with.ordered = TRUE,
  export.params = TRUE, fix.factors = FALSE,
  properties.data = c("numerics", "factors", "ordered", "missings"),
  properties.adding = character(0), properties.needed = character(0),
  properties.target = "cluster", task.type.out = NULL,
  predict.type.map = c(response = "response"), packages = character(0),
  constant.invert = FALSE, cpo.trafo, cpo.retrafo, cpo.invert)

Arguments

cpo.name

[character(1)] The name of the resulting CPOConstructor / CPO. This is used for identification in output, and as the default id.

par.set

[ParamSet] Optional parameter set, for configuration of CPOs during construction or by hyperparameters. Default is an empty ParamSet. It is recommended to use pSS to construct this, as it greatly reduces the verbosity of creating a ParamSet and makes it more readable.

par.vals

[list | NULL] Named list of default parameter values for the CPO. These are used instead of the parameter default values in par.set, if not NULL. It is preferred to use ParamSet default values, and not par.vals. Default is NULL.

dataformat

[character(1)] Indicate what format the data should be as seen by the cpo.train and cpo.retrafo function. The following table shows what values of dataformat lead to what is given to cpo.train and cpo.retrafo as data and target parameter value. (Note that for Feature Operating CPOs, cpo.retrafo has no target argument.) Possibilities are:

dataformat	data	target
“df.all”	`data.frame` with target cols	target colnames
“df.features”	`data.frame` without target	`data.frame` of target
“task”	full `Task`	target colnames
“split”	list of `data.frames` by type	`data.frame` of target
[type]	`data.frame` of [type] feats only	`data.frame` of target

[type] can be any one of “factor”, “numeric”, “ordered”; if these are given, only a subset of the total data present is seen by the CPO.

Note that makeCPORetrafoless accepts only “task” and “df.all”.

For dataformat == "split", cpo.train and cpo.retrafo get a list with entries “factor”, “numeric”, “other”, and, if dataformat.factor.with.ordered is FALSE, “ordered”.

If the CPO is a Feature Operation CPO, then the return value of the cpo.retrafo function must be in the same format as the one requested. E.g. if dataformat is “split”, the return value must be a named list with entries $numeric, $factor, and $other. The types of the returned data may be arbitrary: In the given example, the $factor slot of the returned list may contain numeric data. (Note however that if data is returned that has a type not already present in the data, properties.needed must specify this.)

For Feature Operating CPOs, if dataformat is either “df.all” or “task”, the target column(s) in the returned value of the retrafo function must be identical with the target column(s) given as input.

If dataformat is “split”, the $numeric slot of the value returned by the cpo.retrafo function may also be a matrix. If dataformat is “numeric”, the returned object may also be a matrix.

Default is “df.features” for all functions except makeCPORetrafoless, for which it is “df.all”.

dataformat.factor.with.ordered

[logical(1)] Whether to treat ordered typed features as factor typed features. This affects how dataformat is handled, for which it only has an effect if dataformat is “split” or “factor”. If dataformat is “ordered”, this must be FALSE. It also affects how strictly data fed to a CPORetrafo object is checked for adherence to the data format of data given to the generating CPO. Default is TRUE.

export.params

[logical(1) | character] Indicates which CPO parameters are exported by default. Exported parameters can be changed after construction using setHyperPars, but exporting too many parameters may lead to messy parameter sets if many CPOs are combined using composeCPO or %>>%. The exported parameters can be set during construction, but export.params determines the default exported parameters. If this is a logical(1), TRUE exports all parameters, FALSE to exports no parameters. It may also be a character, indicating the names of parameters to be exported. Default is TRUE.

fix.factors

[logical(1)] Whether to constrain factor levels of new data to the levels of training data, for each factorial or ordered column. If new data contains factors that were not present in training data, the values are set to NA. Default is FALSE.

properties.data

[character] The kind if data that the CPO will be able to handle. This can be one or more of: “numerics”, “factors”, “ordered”, “missings”. There should be a bias towards including properties. If a property is absent, the preproc operator will reject the data. If an operation e.g. only works on numeric columns that have no missings (like PCA), it is recommended to give all properties, ignore the columns that are not numeric (using dataformat = "numeric"), and giving an error when there are missings in the numeric columns (since missings in factorial features are not a problem). Defaults to the maximal set.

properties.adding

[character] Can be one or many of the same values as properties.data for Feature Operation CPOs, and one or many of the same values as properties.target for Target Operation CPOs. These properties get added to a Learner (or CPO) coming after / behind this CPO. When a CPO imputes missing values, for example, this should be “missings”. This must be a subset of “properties.data” or “properties.target”.

Note that this may not contain a Task-type property, even if the CPO is a Target Operation CPO that performs conversion.

Property names may be postfixed with “.sometimes”, to indicate that adherence should not be checked internally. This distinction is made by not putting them in the $adding.min slot of the getCPOProperties return value when get.internal = TRUE.

Default is character(0).

properties.needed

[character] Can be one or many of the same values as properties.data for Feature Operation CPOs, and one or many of the same values as properties.target. These properties are required from a Learner (or CPO) coming after / behind this CPO. E.g., when a CPO converts factors to numerics, this should be “numerics” (and properties.adding should be “factors”).

Note that this may not contain a Task-type property, even if the CPO is a Target Operation CPO that performs conversion.

Property names may be postfixed with “.sometimes”, to indicate that adherence should not be checked internally. This distinction is made by not putting them in the $needed slot of properties. They can still be found in the $needed.max slot of the getCPOProperties return value when get.internal = TRUE.

Default is character(0).

properties.target

[character] For Feature Operation CPOs, this can be one or more of “cluster”, “classif”, “multilabel”, “regr”, “surv”, “oneclass”, “twoclass”, “multiclass”. Just as properties.data, it indicates what kind of data a CPO can work with. To handle data given as data.frame, the “cluster” property is needed. Default is the maximal set.

For Target Operation CPOs, this must contain exactly one of “cluster”, “classif”, “multilabel”, “regr”, “surv”. This indicates the type of Task the CPO can work on. If the input is a data.frame, it is treated as a “cluster” type Task. If the properties.target contains “classif”, the value must then also contain one or more of “oneclass”, “twoclass”, or “multiclass”. Default is “cluster”.

packages

[character] Package(s) that should be loaded when the CPO is constructed. This gives the user an error if a package required for the CPO is not available on his system, or can not be loaded. Default is character(0).

cpo.train

[function | NULL] This is a function which must have the parameters data and target, as well as the parameters specified in par.set. (Alternatively, the function may have only some of these arguments and a dotdotdot argument). It is called whenever a CPO is applied to a data set to prepare for transformation of the training and prediction data. Note that this function is only used in Feature Operating CPOs created with makeCPO, and in Target Operating CPOs created with makeCPOExtendedTargetOp.

The behaviour of this function differs slightly in Feature Operation and Target Operation CPOs.

For Feature Operation CPOs, if cpo.retrafo is NULL, this is a constructor function which must return a “retrafo” function which will then modify (possibly new unseen) data. This retrafo function must have exactly one argument--the (new) data--and return the modified data. The format of the argument, and of the return value of the retrafo function, depends on the value of the dataformat parameter, see documentation there.

If cpo.retrafo is not NULL, this is a function which must return a control object. This control object returned by cpo.train will then be given as the control argument of the cpo.retrafo function, along with (possibly new unseen) data to manipulate.

For Target Operation CPOs, if cpo.retrafo is NULL, cpo.train.invert (or cpo.invert if constant.invert is TRUE) must likewise be NULL. In that case cpo.train's return value is ignored and it must define, within its namespace, two functions cpo.retrafo and cpo.train.invert (or cpo.invert if constant.invert is TRUE) which will take the place of the respective functions. cpo.retrafo must take the parameters data and target, and return the modified target target (or data, depending on dataformat) data. cpo.train.invert must take a data and control argument and return either a modified control object, or a cpo.invert function. cpo.invert must have a target and predict.type argument and return the modified target data.

If cpo.retrafo is not NULL, cpo.train.invert (or cpo.invert if constant.invert is TRUE) must likewise be non-NULL. In that case, cpo.train must return a control object. This control object will then be given as the control argument of both cpo.retrafo and cpo.train.invert (or the control.invert argument of cpo.invert if constant.invert is TRUE).

This parameter may be NULL, resulting in a so-called stateless CPO. For Target Operation CPOs created with makeCPOTargetOp, constant.invert must be TRUE in this case. A stateless CPO does the same transformation for initial CPO application and subsequent prediction data transformation (e.g. taking the logarithm of numerical columns). Note that cpo.retrafo and cpo.invert should not have a control argument in a stateless CPO.

cpo.retrafo

[function | NULL] This is a function which must have the parameters data, target (Target Operation CPOs only) and control, as well as the parameters specified in par.set. (Alternatively, the function may have only some of these arguments and a dotdotdot argument). In Feature Operation CPOs created with makeCPO, if cpo.train is NULL, the control argument must be absent.

This function gets called during the “retransformation” step where prediction data is given to the CPORetrafo object before it is given to a fitted machine learning model for prediction. In makeCPO Featore Operation CPOs and makeCPOTargetOp Target Operation CPOs, this is also called during the first trafo step, where the CPO object is applied to training data.

In Feature Operation CPOs, this function receives the data to be transformed and must return the transformed data in the same format as it received them. The format of data is the same as the format in cpo.train and cpo.trafo, with the exception that if dataformat is “task” or “df.all”, the behaviour here is as if “df.split” had been given.

In Target Operation CPOs created with makeCPOTargetOp, this function receives the data and target to be transformed and must return the transformed target. The input format of these parameters depends on dataformat. If dataformat is “task” or “df.all”, the returned value must be the modified Task / data.frame with the feature columns not modified. Otherwise, the target values to be modified are in the target parameter, and the return value must be a data.frame of the modified target values only.

In Target Operation CPOs created with makeCPOExtendedTargetOp, this function is called during the retrafo step, and it must create a control.invert object in its environment to be used in the inversion step, as well as return the modified target data.The format of the data given to cpo.retrafo in Target Operation CPOs created with makeCPOExtendedTargetOp is the same as in other functions, with the exception that, if dataformat is “df.all” or “task”, the full data.frame or Task will be given as the target parameter, while the data parameter will behave as if dataformat “df.split”. Depending on what object the CPORetrafo object was applied to, the target argument may be NULL; in that case NULL must also be returned by the function.

If cpo.invert is NULL, cpo.retrafo should create a cpo.invert function in its environment instead of creating the control object; this function should then take the target and predict.type arguments. If constant.invert is TRUE, this function does not need to define the control.invert or cpo.invert variables, they are instead taken from cpo.trafo.

cpo.trafo

[function] This is a function which must have the parameters data and target, as well as the parameters specified in par.set. (Alternatively, the function may have only some of these arguments and a dotdotdot argument). It is called whenever a CPO is applied to a data set to transform the training data, and (except for Retrafoless CPOs) to collect a control object used by other transformation functions. Note that this function is not used in makeCPO.

This functions primary task is to transform the given data when the CPO gets applied to training data. For Target Operating CPOs (created with makeCPOExtendedTargetOp(!)), it must return the complete transformed target column(s), unless dataformat is “df.all” (in which case the complete, modified, data.frame must be returned) or “task” (in which case the complete, modified, Task must be returned). It must furthermore create the control objects for cpo.retrafo and cpo.invert, or create these functins themselves, and save them in its function environment (see below). For Retrafoless CPOs (created with makeCPORetrafoless) and Feature Operation CPOs (created with makeCPOExtendedTrafo(!)), it must return the data in the same format as received it in its data argument (depending on dataformat). If dataformat is a df.all or task, this means the target column(s) contained in the data.frame or Task returned must not be modified.

For CPOs that are not Retrafoless, a unit of information to be carried over to the retrafo step needs to be created inside the cpo.trafo function. This unit of information is a variable that must be defined inside the environment of the cpo.trafo function and will be retrieved by the CPO framework.

If cpo.retrafo is not NULL the unit is an object named “control” that will be passed on as the control argument to the cpo.retrafo function. If cpo.retrafo is NULL, the unit is a function, called “cpo.retrafo”, that will be used instead of the cpo.retrafo function passed over to makeCPOExtendedTargetOp / makeCPOExtendedTrafo. It must behave the same as the function it replaces, but has only the data (and target, for Target Operation CPOs) argument.

For Target Operation CPOs created with makeCPOExtendedTargetOp, another unit of information to be used by cpo.invert must be used. The options here are similar to cpo.retrafo: Either a control object, named control.invert, is created, or the cpo.invert function itself is given (and cpo.invert in the makeCPOExtendedTargetOp call is set to NULL), with the target and predict.type arguments.

task.type.out

[character(1) | NULL] If Task conversion is to take place, this is the output task that the data should be converted to. Note that the CPO framework takes care of the conversion if dataformat is not “task”, but the target column needs to have the proper format for that.

If this is NULL, Tasks will not be converted. Default is NULL.

predict.type.map

[character | list] This becomes the CPO's predict.type, explained in detail in PredictType.

In short, the predict.type.map is a character vector, or a list of character(1), with names according to the predict types predict can request in its predict.type argument when the created CPO was used as part of a CPOLearner to create the model under consideration. The values of predict.type.map are the predict.type that will be requested from the underlying Learner for prediction.

predict.type.map thus determines the format that the target parameter of cpo.invert can take: It is the format according to predict.type.map[predict.type], where predict.type is the respective cpo.invert parameter.

constant.invert

[logical(1)] Whether the cpo.invert step should not have information from the previous cpo.retrafo or cpo.train.invert step in Target Operation CPOs (makeCPOTargetOp or makeCPOExtendedTargetOp).

For makeCPOTargetOp, if this is TRUE, the cpo.train.invert argument must be NULL. If cpo.retrafo and cpo.invert are given, the same control object is given to both of them. Otherwise, if cpo.retrafo and cpo.invert are NULL, the cpo.train function must return NULL and define a cpo.retrafo and cpo.invert function in its namespace (see cpo.train documentation for more details). If constant.invert is FALSE, cpo.train may either return a control object that will then be given to cpo.train.invert, or define a cpo.retrafo and cpo.train.invert function in its namespace.

For makeCPOExtendedTargetOp, if this is TRUE, cpo.retrafo does not need to generate a control.invert object. The control.invert object created in cpo.trafo will then always be given to cpo.invert for all data sets.

Default is FALSE.

cpo.train.invert

This is a function which must have the parameters data, and control, as well as the parameters specified in par.set. (Alternatively, the function may have only some of these arguments and a dotdotdot argument).

This function receives the feature columns given for prediction, and must return a control object that will be passed on to the cpo.invert function, or it must return a function that will be treated as the cpo.invert function if the cpo.invert argument is NULL. In the latter case, the returned function takes exactly two arguments (the prediction column to be inverted, and predict.type), and otherwise behaves identically to cpo.invert.

If constant.invert is TRUE, this must be NULL.

cpo.invert

[function | NULL] This is a function which must have the parameters target (a data.frame containing the columns of a prediction made), control.invert, and predict.type, as well as the parameters specified in par.set. (Alternatively, the function may have only some of these arguments and a dotdotdot argument).

The predict.type requested by the predict or invert call is given as a character(1) in the predict.type argument. Note that this is not necessarily the predict.type of the prediction made and given as target argument, depending on the value of predict.type.map (see there).

This function performs the inversion for a Target Operation CPO. It takes a control object, which summarizes information from the training and retrafo step, and the prediction as returned by a machine learning model, and undoes the operation done to the target column in the cpo.trafo function.

For example, if the trafo step consisted of taking the logarithm of a regression target, the cpo.invert function could return the exponentiated prediction values by taking the exp of the only column in the target data.frame and returning the result of that. This kind of operation does not need the cpo.retrafo step and should have skip.retrafo set to TRUE.

As a more elaborate example, a CPO could train a model on the training data and set the target values to the residues of that trained model. The cpo.retrafo function would then make predictions with that model on the new prediction data and save the result to the control object. The cpo.invert function would then add these predictions to the predictions given to it in the target argument to “invert” the antecedent subtraction of model predictions from target values when taking the residues.

Value

[CPOConstructor]. A Constructor for CPOs.

CPO Internals

The mlrCPO package offers a powerful framework for handling the tasks necessary for preprocessing, so that the user, when creating custom CPOs, can focus on the actual data transformations to perform. It is, however, useful to understand what it is that the framework does, and how the process can be influenced by the user during CPO definition or application. Aspects of preprocessing that the user needs to influence are:

Operating Type

The core of preprocessing is the actual transformation being performed. In the most general sense, there are three points in a machine learning pipeline that preprocessing can influence.

Transformation of training data before model fitting, done in mlr using train. In the CPO framework (when not using a CPOLearner which makes all of these steps transparent to the user), this is done by a CPO.
transformation of new validation or prediction data that is given to the fitted model for prediction, done using predict. This is done by a CPORetrafo retrieved using retrafo from the result of step 1.
transformation of the predictions made to invert the transformation of the target values done in step 1, which is done using the CPOInverter retrieved using inverter from the result of step 2.

The framework poses restrictions on primitive (i.e. not compound using composeCPO) CPOs to simplify internal operation: A CPO may be one of three OperatingTypes (see there). The Feature Operation CPO does not transform target columns and hence only needs to be involved in steps 1 and 2. The Target Operation CPO only transforms target columns, and therefore mostly concerns itself with steps 1 and 3. A Retrafoless CPO may change both feature and target columns, but may not perform a retrafo or inverter operation (and is therefore only concerned with step 1). Note that this is effectively a restriction on what kind of transformation a Retrafoless CPO may perform: it must not be a transformation of the data or target space, it may only act or subtract points within this space.

The Operating Type of a CPO is ultimately dependent on the function that was used to create the CPOConstructor: makeCPO / makeCPOExtendedTrafo, makeCPOTargetOp / makeCPOExtendedTargetOp, or makeCPORetrafoless.

Data Transformation

At the core of a CPO is the modification of data it performs. For Feature Operation CPOs, the transformation of each row, during training and prediction, should happen in the same way, and it may only depend on the entirety of the training data--i.e. the value of a data row in a prediction data set may not influence the transformation of a different prediction data row. Furthermore, if a data row occurs in both training and prediction data, its transformation result should ideally be the same.

This property is ensured by makeCPO by splitting the transformation into two functions: One function that collects all relevant information from the training data (called cpo.train), and one that transforms given data, using this collected information and (potentially new, unseen) data to be transformed (called cpo.retrafo). The cpo.retrafo function should handle all data as if it were prediction data and unrelated to the data given to cpo.train.

Internally, when a CPO gets applied to a data set using applyCPO, the cpo.train function is called, and the resulting control object is used for a subsequent cpo.retrafo call which transforms the data. Before the result is given back from the applyCPO call, the control object is used to create a CPORetrafo object, which is attached to the result as attribute. Target Operating CPOs additionally create and add a CPOInverter object.

When a CPORetrafo is then applied to new prediction data, the control object previously returned by cpo.train is given, combined with this new data, to another cpo.retrafo call that performs the new transformation.

makeCPOExtendedTrafo gives more flexibility by having calling only the cpo.trafo in the training step, which both creates a control object and modifies the data. This can increase performance if the underlying operation creates a control object and the transformed data in one step, as for example PCA does. Note that the requirement that the same row in training and prediction data should result in the same transformation result still stands. The cpo.trafo function returns the transformed data and creates a local variable with the control information, which the CPO framework will access.

Inversion

If a CPO performs transformations of the target column, the predictions made by a following machine learning process should ideally have this transformation undone, so that if the process makes a prediction that coincides with a target value after the transformation, the whole pipeline should return a prediction that equals to the target value before this transformation.

This is done by the cpo.invert function given to makeCPOTargetOp. It has access to information from both the preceding training and prediction steps. During the training step, cpo.train createas a control object that is not only given to cpo.retrafo, but also to cpo.train.invert. This latter function is called before the prediction step, whenever new data is fed to the machine learning process. It takes the new data and the old control object and transforms it to a new control.invert object to include information about the prediction data. This object is then given to cpo.invert.

It is possible to have Target Operation CPOs that do not require information from the retrafo step. This is specified by setting constant.invert to TRUE. It has the advantage that the same CPOInverter can be used for inversion of predictions made with any new data. Otherwise, a new CPOInverter object must be obtained for each new data set after the retrafo step (using the inverter function on the retrafo result). Having constant.invert set to TRUE results in hybrid retrafo / inverter objects: The CPORetrafo object can then also be used for inversions. When defining a constant.invert Target Operating CPO, no cpo.train.invert function is given, and the same control object is given to both cpo.retrafo and cpo.invert.

makeCPOExtendedTargetOp gives more flexibility and allows more efficient implementation of Target Operating CPOs at cost of more complexity. With this method, a cpo.trafo function is given that is executed during the first training step; It must return the transformed target column, as well as a control and control.invert object. The cpo.retrafo function not only transforms the target, but must also create a new control.invert object (unless constant.invert is TRUE). The semantics of cpo.invert is identical with the basic makeCPOTargetOp.

cpo.train-cpo.retrafo information transfer

One possibility to transfer information from cpo.train to cpo.retrafo is to have cpo.train return a control object (a list) that is then given to cpo.retrafo. The CPO is then called an object based CPO.

Another possibility is to not give the cpo.retrafo argument (set it to NULL in the makeCPO call) and have cpo.train instead return a function instead. This function is then used as the cpo.retrafo function, and should have access to all relevant information about the training data as a closure. This is called functional CPO. To save memory, the actual data (including target) given to cpo.train is removed from the environment of its return value in this case (i.e. the environment of the cpo.retrafo function). This means the cpo.retrafo function may not reference a “data” variable.

There are similar possibilities of functional information transfer for other types of CPOs: cpo.trafo in makeCPOExtendedTargetOp may create a cpo.retrafo function instead of a control object. cpo.train in makeCPOTargetOp has the option of creating a cpo.retrafo and cpo.train.invert (cpo.invert if constant.invert is TRUE) function (and returning NULL) instead of returning a control object. Similarly, cpo.train.invert may return a cpo.invert function instead of a control.invert object. In makeCPOExtendedTargetOp, cpo.trafo may create a cpo.retrafo or a cpo.invert function, each optionally instead of a control or control.invert object (one or both may be functional). cpo.retrafo similarly may create a cpo.invert function instead of giving a control.invert object. Functional information transfer may be more parsimonious and elegant than control object information transfer.

Hyperparameters

The action performed by a CPO may be influenced using hyperparameters, during its construction as well as afterwards (then using setHyperPars). Hyperparameters must be specified as a ParamSet and given as argument par.set. Default values for each parameter may be specified in this ParamSet or optionally as another argument par.vals.

Hyperparameters given are made part of the CPOConstructor function and can thus be given during construction. Parameter default values function as the default values for the CPOConstructor function parameters (which are thus made optional function parameters of the CPOConstructor function). The CPO framework handles storage and changing of hyperparameter values. When the cpo.train and cpo.retrafo functions are called to transform data, the hyperparameter values are given to them as arguments, so cpo.train and cpo.retrafo functions must be able to accept these parameters, either directly, or with a ... argument.

Note that with functional CPOs, the cpo.retrafo function does not take hyperparameter arguments (and instead can usually refer to them by its environment).

Hyperparameters may be exported (or not), thus making them available for setHyperPars. Not exporting a parameter has advantage that it does not clutter the ParamSet of a big CPO or CPOLearner pipeline with many hyperparameters. Which hyperparameters are exported is chosen during the constructing call of a CPOConstructor, but the default exported hyperparameters can be chosen with the export.params parameter.

Properties

Similarly to Learners, CPOs may specify what kind of data they are and are not able to handle. This is done by specifying .properties.* arguments. The names of possible properties are the same as possible LearnerProperties, but since CPOs mostly concern themselves with data, only the properties indicating column and task types are relevant.

For each CPO one must specify

which kind of data does the CPO handle,
which kind of data must the CPO or Learner be able to handle that comes after the given CPO, and
which kind of data handling capability does the given CPO add to a following CPO or Learner if coming before it in a pipeline.

The specification of (1) is done with properties.data and properties.target, (2) is specified using properties.needed, and (3) is specified using properties.adding. Internally, properties.data and properties.target are concatenated and treated as one vector, they are specified separately in makeCPO etc. for convenience reasons. See CPOProperties for details.

The CPO framework checks the cpo.retrafo etc. functions for adherence to these properties, so it e.g. throws an error if a cpo.retrafo function adds missing values to some data but didn't declare “missings” in properties.needed. It may be desirable to have this internal checking happen to a laxer standard than the property checking when composing CPOs (e.g. when a CPO adds missings only with certain hyperparameters, one may still want to compose this CPO to another one that can't handle missings). Therefore it is possible to postfix listed properties with “.sometimes”. The internal CPO checking will ignore these when listed in properties.adding (it uses the ‘minimal’ set of adding properties, adding.min), and it will not declare them externally when listed in properties.needed (but keeps them internally in the ‘maximal’ set of needed properties, needed.max). The adding.min and needed.max can be retrieved using getCPOProperties with get.internal = TRUE.

Data Format

Different CPOs may want to change different aspects of the data, e.g. they may only care about numeric columns, they may or may not care about the target column values, sometimes they might need the actual task used as input. The CPO framework offers to present the data in a specified formats to the cpo.train, cpo.retrafo and other functions, to reduce the need for boilerplate data subsetting on the user's part. The format is requested using the dataformat and dataformat.factor.with.ordered parameter. A cpo.retrafo function is expected to return data in the same format as it requested, so if it requested a Task, it must return one, while if it only requested the feature data.frame, a data.frame must be returned.

Task Conversion

Target Operation CPOs can be used for conversion between Tasks. For this, the type.out value must be given. Task conversion works with all values of dataformat and is handled by the CPO framework. The cpo.trafo function must take care to return the target data in a proper format (see above). Note that for conversion, not only does the Task type need to be changed during cpo.trafo, but also the prediction format (see above) needs to change.

Fix Factors

Some preprocessing for factorial columns needs the factor levels to be the same during training and prediction. This is usually not guarranteed by mlr, so the framework offers to do this if the fix.factors flag is set.

ID

To prevent parameter name clashes when CPOs are concatenated, the parameters are prefixed with the CPOs id. The ID can be set during CPO construction, but will default to the CPOs name if not given. The name is set using the cpo.name parameter.

Packages

Whenever a CPO needs certain packages to be installed to work, it can specify these in the packages parameter. The framework will check for the availability of the packages and throw an error if not found during construction. This means that loading a CPO from a savefile will omit this check, but in most cases it is a sufficient measure to make the user aware of missing packages in time.

Target Column Format

Different Task types have the target in a different formats. They are listed here for reference. Target data is in this format when given to the target argument of some functions, and must be returned in this format by cpo.trafo in Target Operation CPOs. Target values are always in the format of a data.frame, even when only one column.

Task type	target format
“classif”	one column of `factor`
“cluster”	`data.frame` with zero columns.
“multilabel”	several columns of `logical`
“regr”	one column of `numeric`
“surv”	two columns of `numeric`

When inverting, the format of the target argument, as well as the return value of, the cpo.invert function depends on the Task type as well as the predict.type. The requested return value predict.type is given to the cpo.invert function as a parameter, the predict.type of the target parameter depends on this and the predict.type.map (see PredictType). The format of the prediction, depending on the task type and predict.type, is:

Task type	`predict.type`	target format
“classif”	“response”	`factor`
“classif”	“prob”	`matrix` with nclass cols
“cluster”	“response”	`integer` cluster index
“cluster”	“prob”	`matrix` with nclustr cols
“multilabel”	“response”	`logical` `matrix`
“multilabel”	“prob”	`matrix` with nclass cols
“regr”	“response”	`numeric`
“regr”	“se”	2-col `matrix`
“surv”	“response”	`numeric`
“surv”	“prob”	[NOT YET SUPPORTED]

All matrix formats are numeric, unless otherwise stated.

Headless function definitions

In the place of all cpo.* arguments, it is possible to make a headless function definition, consisting only of the function body. This function body must always begin with a ‘{’. For example, instead of cpo.retrafo = function(data, control) data[-1], it is possible to use cpo.retrafo = function(data, control) { data[-1] }. The necessary function head is then added automatically by the CPO framework. This will always contain the necessary parameters (e.g. “data”, “target”, hyperparameters as defined in par.set) in the names as required. This can declutter the definition of a CPOConstructor and is recommended if the CPO consists of few lines.

Note that if this is used when writing an R package, inside a function, this may lead to the automatic R correctness checker to print warnings.

Examples

Run this code

# NOT RUN {
# an example constant feature remover CPO
constFeatRem = makeCPO("constFeatRem",
 dataformat = "df.features",
 cpo.train = function(data, target) {
   names(Filter(function(x) {  # names of columns to keep
       length(unique(x)) > 1
     }, data))
   }, cpo.retrafo = function(data, control) {
   data[control]
 })
# alternatively:
constFeatRem = makeCPO("constFeatRem",
  dataformat = "df.features",
  cpo.train = function(data, target) {
    cols.keep = names(Filter(function(x) {
        length(unique(x)) > 1
      }, data))
    # the following function will do both the trafo and retrafo
    result = function(data) {
      data[cols.keep]
    }
    result
  }, cpo.retrafo = NULL)
# }

Run the code above in your browser using DataLab