sdf_create_dummy_variables: Create Dummy Variables

Description

Given a column in a Spark DataFrame, generate a new Spark DataFrame containing dummy variable columns.

Usage

sdf_create_dummy_variables(x, input, reference = NULL, labels = list(), envir = new.env(parent = emptyenv()))

Arguments

An object coercable to a Spark DataFrame (typically, a tbl_spark).

input

The name of the input column.

reference

The reference label. This variable is omitted when generating dummy variables (to avoid perfect multi-collinearity if all dummy variables were to be used in the model fit); to generate dummy variables for all columns this can be explicitly set as NULL.

labels

An optional R list, mapping values in the input column to column names to be assigned to the associated dummy variable.

envir

An optional R environment; when provided, it will be filled with useful auxiliary information. See Auxiliary Information for more information.

Auxiliary Information

The envir argument can be used as a mechanism for returning optional information. Currently, the following pieces are returned:

levels:

The set of unique values discovered within the input column.

columns:

The column names generated.

If the envir argument is supplied, the names of any dummy variables generated will be included, under the labels key.

Details

The dummy variables are generated in a similar mechanism to model.matrix, where categorical variables are expanded into a set of binary (dummy) variables. These dummy variables can be used for regression of categorical variables within the various regression routines provided by sparklyr.