Applies an R function to a Spark object (typically, a Spark DataFrame).
spark_apply(x, f, columns = colnames(x), memory = TRUE, group_by = NULL,
packages = TRUE, ...)
An object (usually a spark_tbl
) coercable to a Spark DataFrame.
A function that transforms a data frame partition into a data frame.
The function f
has signature f(df, group1, group2, ...)
where
df
is a data frame with the data to be processed and group1
to
groupN
contain the values of the group_by
values. When
group_by
is not specified, f
takes only one argument.
A vector of column names or a named vector of column types for the transformed object. Defaults to the names from the original object and adds indexed column names when not enough columns are specified.
Boolean; should the table be cached into memory?
Column name used to group by data frame partitions.
Boolean to distribute .libPaths()
packages to each node,
a list of packages to distribute, or a package bundle created with
spark_apply_bundle()
.
For clusters using Livy or Yarn cluster mode, packages
must
point to a package bundle path created using spark_apply_bundle()
and made available as a Spark file. For Yarn cluster mode, the bundle
can be registered as a Spark file using config$sparklyr.shell.files
.
For Livy, this bundle must be copied into the cluster and then made available
using invoke(spark_context(sc), "addFile", "<path-to-file>")
.
For offline clusters where available.packages()
is not available,
manually download the packages database from
https://cran.r-project.org/web/packages/packages.rds and set
Sys.setenv(sparklyr.apply.packagesdb = "<pathl-to-rds>")
. Otherwise,
all packages will be used by default.
Optional arguments; currently unused.