Applies an R function to a Spark object (typically, a Spark DataFrame).
spark_apply(x, f, columns = colnames(x), memory = TRUE, group_by = NULL,
packages = TRUE, context = NULL, ...)
An object (usually a spark_tbl
) coercable to a Spark DataFrame.
A function that transforms a data frame partition into a data frame.
The function f
has signature f(df, context, group1, group2, ...)
where
df
is a data frame with the data to be processed, context
is an optional object passed as the context
parameter and group1
to
groupN
contain the values of the group_by
values. When
group_by
is not specified, f
takes only one argument.
A vector of column names or a named vector of column types for the transformed object. Defaults to the names from the original object and adds indexed column names when not enough columns are specified.
Boolean; should the table be cached into memory?
Column name used to group by data frame partitions.
Boolean to distribute .libPaths()
packages to each node,
a list of packages to distribute, or a package bundle created with
spark_apply_bundle()
.
For clusters using Yarn cluster mode, packages
can point to a package
bundle created using spark_apply_bundle()
and made available as a Spark
file using config$sparklyr.shell.files
. For clusters using Livy, packages
can be manually installed on the driver node.
For offline clusters where available.packages()
is not available,
manually download the packages database from
https://cran.r-project.org/web/packages/packages.rds and set
Sys.setenv(sparklyr.apply.packagesdb = "<pathl-to-rds>")
. Otherwise,
all packages will be used by default.
Optional object to be serialized and passed back to f()
.
Optional arguments; currently unused.
spark_config()
settings can be specified to change the workers
environment.
For instance, to set additional environment variables to each
worker node use the sparklyr.apply.env.*
config, to launch workers
without --vanilla
use sparklyr.apply.options.vanilla
set to
FALSE
, to run a custom script before launching Rscript use
sparklyr.apply.options.rscript.before
.
# NOT RUN {
library(sparklyr)
sc <- spark_connect(master = "local")
# creates an Spark data frame with 10 elements then multiply times 10 in R
sdf_len(sc, 10) %>% spark_apply(function(df) df * 10)
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab