Pre-process the Inputs to a Spark ML Routine

Pre-process / normalize the inputs typically passed to a Spark ML routine.

ml_prepare_response_features_intercept(x = NULL, response, features,
  intercept, envir = parent.frame(),
  categorical.transformations = new.env(parent = emptyenv()),
  ml.options = ml_options())

ml_prepare_features(x, features, envir = parent.frame(), ml.options = ml_options())


An object coercable to a Spark DataFrame (typically, a tbl_spark).


The name of the response vector (as a length-one character vector), or a formula, giving a symbolic description of the model to be fitted. When response is a formula, it is used in preference to other parameters to set the response, features, and intercept parameters (if available). Currently, only simple linear combinations of existing parameters is supposed; e.g. response ~ feature1 + feature2 + .... The intercept term can be omitted by using - 1 in the model fit.


The name of features (terms) to use for the model fit.


Boolean; should the model be fit with an intercept term?


The R environment in which the response, features and intercept bindings should be mutated. (Typically, the parent frame).


An R environment used to record what categorical variables were binarized in this procedure. Categorical variables that included in the model formula will be transformed into binary variables, and the generated mappings will be stored in this environment.


Optional arguments, used to affect the model generated. See ml_options for more details.


Pre-processing of these inputs typically involves:

  1. Handling the case where response is itself a formula describing the model to be fit, thereby extracting the names of the response and features to be used,

  2. Splitting categorical features into dummy variables (so they can easily be accommodated + specified in the underlying Spark ML model fit),

  3. Mutating the associated variables in the specified environment.

Please take heed of the last point, as while this is useful in practice, the behavior will be very surprising if you are not expecting it.

  • ml_prepare_response_features_intercept
  • ml_prepare_inputs
  • ml_prepare_features
# note that ml_prepare_features, by default, mutates the 'features'
# binding in the same environment in which the function was called
   ml_prepare_features(features = ~ x1 + x2 + x3)
   print(features) # c("x1", "x2", "x3")
# }
Documentation reproduced from package sparklyr, version 0.6.4, License: Apache License 2.0 | file LICENSE

Community examples

Looks like there are no examples yet.