ml_prepare_dataframe
From sparklyr v0.2.31
by Javier Luraschi
Prepare a Spark DataFrame for Spark ML Routines
This routine prepares a Spark DataFrame for use by Spark ML routines.
Usage
ml_prepare_dataframe(x, features, response = NULL, ..., envir = new.env(parent = emptyenv()))
Arguments
- x
- An object coercable to a Spark DataFrame (typically, a
tbl_spark
). - features
- The name of features (terms) to use for the model fit.
- response
- The name of the response vector (as a length-one character
vector), or a formula, giving a symbolic description of the model to be
fitted. When
response
is a formula, it is used in preference to other parameters to set theresponse
,features
, andintercept
parameters (if available). Currently, only simple linear combinations of existing parameters is supposed; e.g.response ~ feature1 + feature2 + ...
. The intercept term can be omitted by using- 1
in the model fit. - ...
- Optional arguments; currently unused.
- envir
- An R environment -- when supplied, it will be filled with metadata describing the transformations that have taken place.
Details
Spark DataFrames are prepared through the following transformations:
- All specified columns are transformed into a numeric data type
(using a simple cast for integer / logical columns, and
ft_string_indexer
for strings), - The
ft_vector_assembler
is used to combine the specified features into a single 'feature' vector, suitable for use with Spark ML routines.
After calling this function, the envir
environment (when supplied)
will be populated with a set of variables:
features : |
The name of the generated features vector. |
response : |
The name of the generated response vector. |
Examples
## Not run:
# # example of how 'ml_prepare_dataframe' might be used to invoke
# # Spark's LinearRegression routine from the 'ml' package
# envir <- new.env(parent = emptyenv())
# tdf <- ml_prepare_dataframe(df, features, response, envir = envir)
#
# lr <- invoke_new(
# sc,
# "org.apache.spark.ml.regression.LinearRegression"
# )
#
# # use generated 'features', 'response' vector names in model fit
# model <- lr %>%
# invoke("setFeaturesCol", envir$features) %>%
# invoke("setLabelCol", envir$response)
# ## End(Not run)
Community examples
Looks like there are no examples yet.