Apply an analytic recombination method to a ddo/ddf object and combine the results
recombine(data, combine = NULL, apply = NULL, output = NULL,
overwrite = FALSE, params = NULL, packages = NULL, control = NULL,
verbose = TRUE)
an object of class "ddo" of "ddf"
the method to combine the results.
See, for example, combCollect
, combDdf
, combDdo
, combRbind
, etc. If combine = NULL
, combCollect
will be used if output = NULL
and combDdo
is used if output
is specified.
a function specifying the analytic method to apply to each subset, or a pre-defined apply function (see drBLB
, drGLM
, for example).
NOTE: This argument is now deprecated in favor of addTransform
a "kvConnection" object indicating where the output data should reside (see localDiskConn
, hdfsConn
). If NULL
(default), output will be an in-memory "ddo" object
logical; should existing output location be overwritten? (also can specify overwrite = "backup"
to move the existing output to _bak)
a named list of objects external to the input data that are needed in the distributed computing (most should be taken care of automatically such that this is rarely necessary to specify)
a vector of R package names that contain functions used in fn
(most should be taken care of automatically such that this is rarely necessary to specify)
parameters specifying how the backend should handle things (most-likely parameters to rhwatch
in RHIPE) - see rhipeControl
and localDiskControl
logical - print messages about what is being done
Depends on combine
: this could be a distributed data object, a data frame, a key-value list, etc. See examples.
divide
, ddo
, ddf
, drGLM
, drBLB
, combMeanCoef
, combMean
, combCollect
, combRbind
, drLapply
# NOT RUN {
## in-memory example
##---------------------------------------------------------
# begin with an in-memory ddf (backed by kvMemory)
bySpecies <- divide(iris, by = "Species")
# create a function to calculate the mean for each variable
colMean <- function(x) data.frame(lapply(x, mean))
# apply the transformation
bySpeciesTransformed <- addTransform(bySpecies, colMean)
# recombination with no 'combine' argument and no argument to output
# produces the key-value list produced by 'combCollect()'
recombine(bySpeciesTransformed)
# but we can also preserve the distributed data frame, like this:
recombine(bySpeciesTransformed, combine = combDdf)
# or we can recombine using 'combRbind()' and produce a data frame:
recombine(bySpeciesTransformed, combine = combRbind)
## local disk connection example with parallelization
##---------------------------------------------------------
# create a 2-node cluster that can be used to process in parallel
cl <- parallel::makeCluster(2)
# create the control object we'll pass into local disk datadr operations
control <- localDiskControl(cluster = cl)
# note that setting options(defaultLocalDiskControl = control)
# will cause this to be used by default in all local disk operations
# create local disk connection to hold bySpecies data
ldPath <- file.path(tempdir(), "by_species")
ldConn <- localDiskConn(ldPath, autoYes = TRUE)
# convert in-memory bySpecies to local-disk ddf
bySpeciesLD <- convert(bySpecies, ldConn)
# apply the transformation
bySpeciesTransformed <- addTransform(bySpeciesLD, colMean)
# recombine the data using the transformation
bySpeciesMean <- recombine(bySpeciesTransformed,
combine = combRbind, control = control)
bySpeciesMean
# remove temporary directories
unlink(ldPath, recursive = TRUE)
# shut down the cluster
parallel::stopCluster(cl)
# }
Run the code above in your browser using DataLab