PipeOpTaskPreproc: PipeOpTaskPreproc

Description

Base class for handling most "preprocessing" operations. These are operations that have exactly one Task input and one Task output, and expect the column layout of these Tasks during input and output to be the same.

Users must implement $train_task() and $predict_task(), which have a Task input and should return that Task. The Task should, if possible, be manipulated in-place, and should not be cloned.

Alternatively, the $train_dt() and $predict_dt() functions can be implemented, which operate on data.table objects instead. This should generally only be done if all data is in some way altered (e.g. PCA changing all columns to principal components) and not if only a few columns are added or removed (e.g. feature selection) because this should be done at the Task-level with $train_task(). The $select_cols() function can be overloaded for $train_dt() and $predict_dt() to operate only on subsets of the Task's data, e.g. only on numerical columns.

If the can_subset_cols argument of the constructor is TRUE (the default), then the hyperparameter affect_columns is added, which can limit the columns of the Task that is modified by the PipeOpTaskPreproc using a Selector function. Note this functionality is entirely independent of the $select_cols() functionality.

PipeOpTaskPreproc is useful for operations that behave differently during training and prediction. For operations that perform essentially the same operation and only need to perform extra work to build a $state during training, the PipeOpTaskPreprocSimple class can be used instead.

Arguments

Format

Abstract R6Class inheriting from PipeOp.

Construction

PipeOpTaskPreproc$new(id, param_set = ParamSet$new(), param_vals = list(), can_subset_cols = TRUE, packages = character(0), task_type = "Task")

id :: character(1) Identifier of resulting object. See $id slot of PipeOp.
param_set :: ParamSet Parameter space description. This should be created by the subclass and given to super$initialize().
param_vals :: named list List of hyperparameter settings, overwriting the hyperparameter settings given in param_set. The subclass should have its own param_vals parameter and pass it on to super$initialize(). Default list().
can_subset_cols :: logical(1) Whether the affect_columns parameter should be added which lets the user limit the columns that are modified by the PipeOpTaskPreproc. This should generally be FALSE if the operation adds or removes rows from the Task, and TRUE otherwise. Default is TRUE.
packages :: character Set of all required packages for the PipeOp's $train and $predict methods. See $packages slot. Default is character(0).
task_type :: character(1) The class of Task that should be accepted as input and will be returned as output. This should generally be a character(1) identifying a type of Task, e.g. "Task", "TaskClassif" or "TaskRegr" (or another subclass introduced by other packages). Default is "Task".

Input and Output Channels

PipeOpTaskPreproc has one input channel named "input", taking a Task, or a subclass of Task if the task_type construction argument is given as such; both during training and prediction.

PipeOpTaskPreproc has one output channel named "output", producing a Task, or a subclass; the Task type is the same as for input; both during training and prediction.

The output Task is the modified input Task according to the overloaded $train_task()/$predict_taks() or $train_dt()/$predict_dt() functions.

State

The $state is a named list; besides members added by inheriting classes, the members are:

affect_cols :: character Names of features being selected by the affect_columns parameter, if present; names of all present features otherwise.
intasklayout :: data.table Copy of the training Task's $feature_types slot. This is used during prediction to ensure that the prediction Task has the same features, feature layout, and feature types as during training.
outtasklayout :: data.table Copy of the trained Task's $feature_types slot. This is used during prediction to ensure that the Task resulting from the prediction operation has the same features, feature layout, and feature types as after training.
dt_columns :: character Names of features selected by the $select_cols() call during training. This is only present if the $train_dt() functionality is used, and not present if the $train_task() function is overloaded instead.

Parameters

affect_columns :: function | Selector | NULL What columns the PipeOpTaskPreproc should operate on. This parameter is only present if the constructor is called with the can_subset_cols argument set to TRUE (the default). The parameter must be a Selector function, which takes a Task as argument and returns a character of features to use. See Selector for example functions. Defaults to NULL, which selects all features.

Internals

PipeOpTaskPreproc is an abstract class inheriting from PipeOp. It implements the $train_internal() and $predict_internal() functions. These functions perform checks and go on to call $train_task() and $predict_task(). A subclass of PipeOpTaskPreproc may implement these functions, or implement $train_dt() and $predict_dt() instead. This works by having the default implementations of $train_task() and $predict_task() call $train_dt() and $predict_dt(), respectively.

The affect_columns functionality works by unsetting columns by removing their "col_role" before processing, and adding them afterwards by setting the col_role to "feature".

Fields

Fields inherited from PipeOp.

Methods

Methods inherited from PipeOp, as well as:

train_task (Task) -> Task Called by the PipeOpTaskPreproc's implementation of $train_internal(). Takes a single Task as input and modifies it (ideally in-place without cloning) while storing information in the $state slot. Note that unlike $train_internal(), the argument is not a list but a singular Task, and the return object is also not a list but a singular Task. Also, contrary to $train_internal(), the $state being generated must be a list, which the PipeOpTaskPreproc will add additional slots to (see Section State). Care should be taken to avoid name collisions between $state elements added by $train_task() and PipeOpTaskPreproc. By default this function calls the $train_dt() function, but it can be overloaded to perform operations on the Task directly.
predict_task (Task) -> Task Called by the PipeOpTaskPreproc's implementation of $predict_internal(). Takes a single Task as input and modifies it (ideally in-place without cloning) while using information in the $state slot. Works analogously to $train_task(). If $predict_task() should only be overloaded if $train_task() is overloaded (i.e. $train_dt() is not used).
train_dt(dt, levels, target) (data.table, named list, any) -> data.table | data.frame | matrix Train PipeOpTaskPreproc on dt, transform it and store a state in $state. A transformed object must be returned that can be converted to a data.table using as.data.table. dt does not need to be copied deliberately, it is possible and encouraged to change it in-place. The levels argument is a named list of factor levels for factorial or character features. The target argument contains the $truth() information of the training Task; its type depends on the Task type being trained on. This method can be overloaded when inheriting from PipeOpTaskPreproc, together with $predict_dt() and optionally $select_cols(); alternatively, $train_task() and $predict_task() can be overloaded.
predict_dt(dt, levels) (data.table, named list) -> data.table | data.frame | matrix Predict on new data in dt, possibly using the stored $state. A transformed object must be returned that can be converted to a data.table using as.data.table. dt does not need to be copied deliberately, it is possible and encouraged to change it in-place. The levels argument is a named list of factor levels for factorial or character features. This method can be overloaded when inheriting PipeOpTaskPreproc, together with $train_dt() and optionally $select_cols(); alternatively, $train_task() and $predict_task() can be overloaded.
select_cols(task) (Task) -> character Selects which columns the PipeOp operates on, if $train_dt() and $predict_dt() are overloaded. This function is not called if $train_task() and $predict_task() are overloaded. In contrast to the affect_columns parameter. select_cols is for the ineriting class to determine which columns the operator should function on, e.g. based on feature type, while affect_columns is a way for the user to limit the columns that a PipeOpTaskPreproc should operate on. This method can optionally be overloaded when inheriting PipeOpTaskPreproc, together with $train_dt() and $predict_dt(); alternatively, $train_task() and $predict_task() can be overloaded. If this method is not overloaded, it defaults to selecting all columns.