
This routine prepares a Spark DataFrame for use by Spark ML routines.
ml_prepare_dataframe(x, features, response = NULL, ...,
ml.options = ml_options(), envir = new.env(parent = emptyenv()))
An object coercable to a Spark DataFrame (typically, a
tbl_spark
).
The name of features (terms) to use for the model fit.
The name of the response vector (as a length-one character
vector), or a formula, giving a symbolic description of the model to be
fitted. When response
is a formula, it is used in preference to other
parameters to set the response
, features
, and intercept
parameters (if available). Currently, only simple linear combinations of
existing parameters is supposed; e.g. response ~ feature1 + feature2 + ...
.
The intercept term can be omitted by using - 1
in the model fit.
Optional arguments. The data
argument can be used to
specify the data to be used when x
is a formula; this allows calls
of the form ml_linear_regression(y ~ x, data = tbl)
, and is
especially useful in conjunction with do
.
Optional arguments, used to affect the model generated. See
ml_options
for more details.
An R environment -- when supplied, it will be filled with metadata describing the transformations that have taken place.
Spark DataFrames are prepared through the following transformations:
All specified columns are transformed into a numeric data type
(using a simple cast for integer / logical columns, and
ft_string_indexer
for strings),
The ft_vector_assembler
is used to combine the
specified features into a single 'feature' vector, suitable
for use with Spark ML routines.
After calling this function, the envir
environment (when supplied)
will be populated with a set of variables:
features : |
The name of the generated features vector. |
response : |
The name of the generated response vector. |
# NOT RUN {
# example of how 'ml_prepare_dataframe' might be used to invoke
# Spark's LinearRegression routine from the 'ml' package
envir <- new.env(parent = emptyenv())
tdf <- ml_prepare_dataframe(df, features, response, envir = envir)
lr <- invoke_new(
sc,
"org.apache.spark.ml.regression.LinearRegression"
)
# use generated 'features', 'response' vector names in model fit
model <- lr %>%
invoke("setFeaturesCol", envir$features) %>%
invoke("setLabelCol", envir$response)
# }
Run the code above in your browser using DataLab