ml_create_dummy_variables
From sparklyr v0.3.5
by Javier Luraschi
Create Dummy Variables
Given a column in a Spark DataFrame, generate a new Spark DataFrame containing dummy variable columns.
Usage
ml_create_dummy_variables(x, input, reference = NULL, levels = NULL, labels = NULL, envir = new.env(parent = emptyenv()))
Arguments
- x
- An object coercable to a Spark DataFrame (typically, a
tbl_spark
). - input
- The name of the input column.
- reference
- The reference label. This variable is omitted when
generating dummy variables (to avoid perfect multi-collinearity if
all dummy variables were to be used in the model fit); to generate
dummy variables for all columns this can be explicitly set as
NULL
. - levels
- The set of levels for which dummy variables should be generated.
By default, constructs one variable for each unique value occurring in
the column specified by
input
. - labels
- An optional R list, mapping values in the
input
column to column names to be assigned to the associated dummy variable. - envir
- An optional R environment; when provided, it will be filled with useful auxiliary information. See Auxiliary Information for more information.
Details
The dummy variables are generated in a similar mechanism to
model.matrix
, where categorical variables are expanded into a
set of binary (dummy) variables. These dummy variables can be used for
regression of categorical variables within the various regression routines
provided by sparklyr
.
Auxiliary Information
The envir
argument can be used as a mechanism for returning
optional information. Currently, the following pieces are returned:
levels : |
The set of unique values discovered within the input column. |
columns : |
The column names generated. |
envir
argument is supplied, the names of any dummy variables
generated will be included, under the labels
key.
Community examples
Looks like there are no examples yet.