PipeOp: PipeOp

Description

A PipeOp represents a transformation of a given "input" into a given "output", with two stages: "training" and "prediction". It can be understood as a generalized function that not only has multiple inputs, but also multiple outputs (as well as two stages). The "training" stage is used when training a machine learning pipeline or fitting a statistical model, and the "predicting" stage is then used for making predictions on new data.

To perform training, the $train() function is called which takes inputs and transforms them, while simultaneously storing information in its $state slot. For prediction, the $predict() function is called, where the $state information can be used to influence the transformation of the new data.

A PipeOp is usually used in a Graph object, a representation of a computational graph. It can have multiple input channels---think of these as multiple arguments to a function, for example when averaging different models---, and multiple output channels---a transformation may return different objects, for example different subsets of a Task. The purpose of the Graph is to connect different outputs of some PipeOps to inputs of other PipeOps.

Input and output channel information of a PipeOp is defined in the $input and $output slots; each channel has a name, a required type during training, and a required type during prediction. The $train() and $predict() function are called with a list argument that has one entry for each declared channel (with one exception, see next paragraph). The list is automatically type-checked for each channel against $input and then passed on to the $train_internal() or $predict_internal() functions. There the data is processed and a result list is created. This list is again type-checked for declared output types of each channel. The length and types of the result list is as declared in $output.

A special input channel name is "...", which creates a vararg channel that takes arbitrarily many arguments, all of the same type. If the $input table contains an "..."-entry, then the input given to $train() and $predict() may be longer than the number of declared input channels.

This class is an abstract base class that all PipeOps being used in a Graph should inherit from, and is not intended to be instantiated.

Arguments

Format

Abstract R6Class.

Construction

PipeOp$new(id, param_set = ParamSet$new(), param_vals = list(), input, output, packages = character(0))

id :: character(1) Identifier of resulting object. See $id slot.
param_set :: ParamSet | list of expression Parameter space description. This should be created by the subclass and given to super$initialize(). If this is a ParamSet, it is used as the PipeOp's ParamSet directly. Otherwise it must be a list of expressions e.g. created by alist() that evaluate to ParamSets. These ParamSet are combined using a ParamSetCollection.
param_vals :: named list List of hyperparameter settings, overwriting the hyperparameter settings given in param_set. The subclass should have its own param_vals parameter and pass it on to super$initialize(). Default list().
input :: data.table with columns name (character), train (character), predict (character) Sets the $input slot of the resulting object; see description there.
output :: data.table with columns name (character), train (character), predict (character) Sets the $output slot of the resulting object; see description there.
packages :: character Set of all required packages for the PipeOp's $train and $predict methods. See $packages slot. Default is character(0).

Internals

PipeOp is an abstract class with abstract functions $train_internal() and $predict_internal(). To create a functional PipeOp class, these two methods must be implemented. Each of these functions receives a named list according to the PipeOp's input channels, and must return a list (names are ignored) with values in the order of output channels in $output. The $train_internal() and $predict_internal() function should not be called by the user; instead, a $train() and $predict() should be used. The most convenient usage is to add the PipeOp to a Graph (possibly as singleton in that Graph), and using the Graph's $train() / $predict() methods.

Fields

id :: character ID of the PipeOp. IDs are user-configurable, and IDs of PipeOps must be unique within a Graph. IDs of PipeOps must not be changed once they are part of a Graph, instead the Graph's $set_names() method should be used.
packages :: character Packages required for the PipeOp. Functions that are not in base R should still be called using :: (or explicitly attached using require()) in $train_internal() and $predict_internal(), but packages declared here are checked before any (possibly expensive) processing has started within a Graph.
param_set :: ParamSet Parameters and parameter constraints. Parameter values that influence the functioning of $train and / or $predict are in the $param_set$values slot; these are automatically checked against parameter constraints in $param_set.
state :: any | NULL Method-dependent state obtained during training step, and usually required for the prediction step. This is NULL if and only if the PipeOp has not been trained. The $state is the only slot that can be reliably modified during $train(), because $train_internal() may theoretically be executed in a different R-session (e.g. for parallelization).
input :: data.table with columns name (character), train (character), predict (character) Input channels of PipeOp. Column name gives the names (and order) of values in the list given to $train() and $predict(). Column train is the (S3) class that an input object must conform to during training, column predict is the (S3) class that an input object must conform to during prediction. Types are checked by the PipeOp itself and do not need to be checked by $train_internal() / $predict_internal() code. A special name is "...", which creates a vararg input channel that accepts a variable number of inputs.
output :: data.table with columns name (character), train (character), predict (character) Output channels of PipeOp, in the order in which they will be given in the list returned by $train and $predict functions. Column train is the (S3) class that an output object must conform to during training, column predict is the (S3) class that an output object must conform to during prediction. The PipeOp checks values returned by $train_internal() and $predict_internal() against these types specifications.
innum :: numeric(1) Number of input channels. This equals nrow($input).
outnum :: numeric(1) Number of output channels. This equals nrow($output).
is_trained :: logical(1) Indicate whether the PipeOp was already trained and can therefore be used for prediction.
hash :: character(1) Checksum calculated on the PipeOp, depending on the PipeOp's class and the slots $id and $param_set (and therefore also $param_set$values). If a PipeOp's functionality may change depending on more than these values, it should inherit the $hash active binding and calculate the hash as digest(list(super$hash, <OTHER THINGS>), algo = "xxhash64").
.result :: list If the Graph's $keep_results flag is set to TRUE, then the intermediate Results of $train() and $predict() are saved to this slot, exactly as they are returned by these functions. This is mainly for debugging purposes and done, if requested, by the Graph backend itself; it should not be done explicitly by $train_internal() or $predict_internal().

Methods

train(input) (list) -> named list Train PipeOp on inputs, transform it to output and store the learned $state. If the PipeOp is already trained, already present $state is overwritten. Input list is typechecked against the $input train column. Return value is a list with as many entries as $output has rows, with each entry named after the $output name column and class according to the $output train column.
train_internal(input) (named list) -> list Abstract function that must be implemented by concrete subclasses. $train_internal() is called by $train() after typechecking. It must change the $state value to something non-NULL and return a list of transformed data according to the $output train column. Names of the returned list are ignored. The $train_internal() method should not be called by a user; instead, the $train() method should be used which does some checking and possibly type conversion.
predict(input) (list) -> named list Predict on new data in input, possibly using the stored $state. Input and output are specified by $input and $output in the same way as for $train(), except that the predict column is used for type checking.
predict_internal(input) (named list) -> list Abstract function that must be implemented by concrete subclasses. $predict_internal() is called by $predict() after typechecking and works analogously to $train_internal(). Unlike $train_internal(), $predict_internal() should not modify the PipeOp in any way. Just as $train_internal(), $predict_internal() should not be called by a user; instead, the $predict() method should be used.
print() () -> NULL Prints the PipeOps most salient information: $id, $is_trained, $param_set$values, $input and $output.

Examples

Run this code

# NOT RUN {
# example (bogus) PipeOp that returns the sum of two numbers during $train()
# as well as a letter of the alphabet corresponding to that sum during $predict().

PipeOpSumLetter = R6::R6Class("sumletter",
  inherit = PipeOp,  # inherit from PipeOp
  public = list(
    initialize = function(id = "posum", param_vals = list()) {
      super$initialize(id, param_vals = param_vals,
        # declare "input" and "output" during construction here
        # training takes two 'numeric' and returns a 'numeric';
        # prediction takes 'NULL' and returns a 'character'.
        input = data.table::data.table(name = c("input1", "input2"),
          train = "numeric", predict = "NULL"),
        output = data.table::data.table(name = "output",
          train = "numeric", predict = "character")
      )
    },

    # PipeOp deriving classes must implement train_internal and
    # predict_internal; each taking an input list and returning
    # a list as output.
    train_internal = function(input) {
      sum = input[[1]] + input[[2]]
      self$state = sum
      list(sum)
    },

    predict_internal = function(input) {
      list(letters[self$state])
    }
  )
)
posum = PipeOpSumLetter$new()

print(posum)

posum$train(list(1, 2))
# note the name 'output' is the name of the output channel specified
# in the $output data.table.

posum$predict(list(NULL, NULL))
# }

Run the code above in your browser using DataLab