datadr (version 0.8.4)

recombine: Recombine

Description

Apply an analytic recombination method to a ddo/ddf object and combine the results

Usage

recombine(data, combine = NULL, apply = NULL, output = NULL,
  overwrite = FALSE, params = NULL, packages = NULL, control = NULL,
  verbose = TRUE)

Arguments

data
an object of class "ddo" of "ddf"
combine
the method to combine the results. See, for example, combCollect, combDdf, combDdo,
apply
a function specifying the analytic method to apply to each subset, or a pre-defined apply function (see drBLB, drGLM, for example). NOTE: This argument is now de
output
a "kvConnection" object indicating where the output data should reside (see localDiskConn, hdfsConn). If NULL (default), output will be
overwrite
logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak)
params
a named list of objects external to the input data that are needed in the distributed computing (most should be taken care of automatically such that this is rarely necessary to specify)
packages
a vector of R package names that contain functions used in fn (most should be taken care of automatically such that this is rarely necessary to specify)
control
parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskContr
verbose
logical - print messages about what is being done

Value

  • Depends on combine: this could be a distributed data object, a data frame, a key-value list, etc. See examples.

References

  • http://tessera.io
  • http://onlinelibrary.wiley.com/doi/10.1002/sta4.7/full{Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., & Cleveland, W. S. (2012). Large complex data: divide and recombine (D&R) with RHIPE.Stat, 1(1), 53-67.

See Also

divide, ddo, ddf, drGLM, drBLB, combMeanCoef, combMean, combCollect, combRbind, drLapply

Examples

Run this code
## in-memory example
##---------------------------------------------------------

# begin with an in-memory ddf (backed by kvMemory)
bySpecies <- divide(iris, by = "Species")

# create a function to calculate the mean for each variable
colMean <- function(x) data.frame(lapply(x, mean))

# apply the transformation
bySpeciesTransformed <- addTransform(bySpecies, colMean)

# recombination with no 'combine' argument and no argument to output
# produces the key-value list produced by 'combCollect()'
recombine(bySpeciesTransformed)

# but we can also preserve the distributed data frame, like this:
recombine(bySpeciesTransformed, combine = combDdf)

# or we can recombine using 'combRbind()' and produce a data frame:
recombine(bySpeciesTransformed, combine = combRbind)

## local disk connection example with parallelization
##---------------------------------------------------------

# create a 2-node cluster that can be used to process in parallel
cl <- parallel::makeCluster(2)

# create the control object we'll pass into local disk datadr operations
control <- localDiskControl(cluster = cl)
# note that setting options(defaultLocalDiskControl = control)
# will cause this to be used by default in all local disk operations

# create local disk connection to hold bySpecies data
ldPath <- file.path(tempdir(), "by_species")
ldConn <- localDiskConn(ldPath, autoYes = TRUE)

# convert in-memory bySpecies to local-disk ddf
bySpeciesLD <- convert(bySpecies, ldConn)

# apply the transformation
bySpeciesTransformed <- addTransform(bySpeciesLD, colMean)

# recombine the data using the transformation
bySpeciesMean <- recombine(bySpeciesTransformed,
  combine = combRbind, control = control)
bySpeciesMean

# remove temporary directories
unlink(ldPath, recursive = TRUE)

# shut down the cluster
parallel::stopCluster(cl)

Run the code above in your browser using DataLab