spark_apply: Apply an R Function in Spark

Description

Applies an R function to a Spark object (typically, a Spark DataFrame).

Usage

spark_apply(x, f, columns = colnames(x), memory = TRUE, group_by = NULL,
  packages = TRUE, context = NULL, ...)

Arguments

An object (usually a spark_tbl) coercable to a Spark DataFrame.

A function that transforms a data frame partition into a data frame. The function f has signature f(df, context, group1, group2, ...) where df is a data frame with the data to be processed, context is an optional object passed as the context parameter and group1 to groupN contain the values of the group_by values. When group_by is not specified, f takes only one argument.

columns

A vector of column names or a named vector of column types for the transformed object. Defaults to the names from the original object and adds indexed column names when not enough columns are specified.

memory

Boolean; should the table be cached into memory?

group_by

Column name used to group by data frame partitions.

packages

Boolean to distribute .libPaths() packages to each node, a list of packages to distribute, or a package bundle created with spark_apply_bundle().

For clusters using Yarn cluster mode, packages can point to a package bundle created using spark_apply_bundle() and made available as a Spark file using config$sparklyr.shell.files. For clusters using Livy, packages can be manually installed on the driver node.

For offline clusters where available.packages() is not available, manually download the packages database from https://cran.r-project.org/web/packages/packages.rds and set Sys.setenv(sparklyr.apply.packagesdb = "<pathl-to-rds>"). Otherwise, all packages will be used by default.

context

Optional object to be serialized and passed back to f().

...

Optional arguments; currently unused.

Configuration

spark_config() settings can be specified to change the workers environment.

For instance, to set additional environment variables to each worker node use the sparklyr.apply.env.* config, to launch workers without --vanilla use sparklyr.apply.options.vanilla set to FALSE, to run a custom script before launching Rscript use sparklyr.apply.options.rscript.before.

Examples

Run this code

# NOT RUN {
library(sparklyr)
sc <- spark_connect(master = "local")

# creates an Spark data frame with 10 elements then multiply times 10 in R
sdf_len(sc, 10) %>% spark_apply(function(df) df * 10)

# }
# NOT RUN {
# }

Run the code above in your browser using DataLab