h2o.ddply: Split H2O dataset, apply function, and return results

Description

For each subset of a H2O dataset, apply a user-specified function, then combine the results.

Usage

h2o.ddply(.data, .variables, .fun = NULL, ..., .progress = "none")

Arguments

.data

An H2OParsedData object to be processed.

.variables

Variables to split .data by, either the indices or names of a set of columns.

.fun

Function to apply to each subset grouping. Must have been pushed to H2O using h2o.addFunction.

...

Additional arguments passed on to .fun. (Currently unimplemented).

.progress

Name of the progress bar to use. (Currently unimplemented).

Value

An H2OParsedData object containing the results from the split/apply operation, arranged row-by-row.

Details

This is an extension of the plyr library's ddply function to datasets loaded into H2O.

References

Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. http://www.jstatsoft.org/v40/i01/.

Examples

Run this code

library(h2o)
localH2O = h2o.init()

# Import iris dataset to H2O
irisPath = system.file("extdata", "iris_wheader.csv", package = "h2o")
iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex")

# Add function taking mean of sepal_len column
fun = function(df) { sum(df[,1], na.rm = T)/nrow(df) }
h2o.addFunction(localH2O, fun)

# Apply function to groups by class of flower
# uses h2o's ddply, since iris.hex is an H2OParsedData object
res = h2o.ddply(iris.hex, "class", fun)
head(res)

Run the code above in your browser using DataLab