Learn R Programming

fileplyr (version 0.2.0)

ddfply: ddfply

Description

performs chunk processing or split-apply-combine on the data in a distributed data frame(ddf)

Usage

ddfply(ddfdir, groupby, fun = identity, collect = "none",
  temploc = getwd(), nbins = 10, chunk = 50000, spill = 1e+06,
  cores = 1, buffer = 1e+09, ...)

Arguments

ddfdir
(string) path of ddf directory
groupby
(character vector) Columns names to used to split the data(if missing, fun is applied on each chunk)
fun
(object of class function) function to apply on each subset after the split
collect
(string) Collect the result as list or dataframe or none. none keeps the resulting ddo on disk.
temploc
(string) Path where intermediary files are kept
nbins
(positive integer) Number of directories into which the distributed dataframe (ddf) or distributed data object (ddo) is distributed
chunk
(positive integer) Number of rows of the file to be read at a time
spill
(positive integer) Maximum number of rows of any subset resulting from split
cores
(positive integer) Number of cores to be used in parallel
buffer
(positive integer) Size of batches of key-value pairs to be passed to the map OR Size of the batches of key-value pairs to flush to intermediate storage from the map output OR Size of the batches of key-value pairs to send to the reduce
...
Arguments to be passed to data.table function asis.

Value

list or a dataframe or a TRUE(when collect is 'none').

Details

see fileply

Examples

Run this code
write.table(mtcars, "mtcars.csv", row.names = FALSE, sep = ",")
# create a ddf by keeping `keepddf = TRUE`
co <- capture.output(temp <- fileply("mtcars.csv"
                                     , groupby = c("carb", "gear")
                                     , fun     = identity
                                     , collect = "list"
                                     , sep     =  ","
                                     , header  = TRUE
                                     , keepddf = TRUE)
                     , file = NULL
                     , type = "message"
                     )
# use the ddf instead of reading the CSV again
temp2 <- ddfply(file.path(strsplit(co[6], ": ")[[1]][2], "data")
                , groupby = c("gear")
                , fun     = identity
                , collect = "list"
                , sep     =  ","
                , header  = TRUE
                )
temp2
unlink("mtcars.csv")
unlink(strsplit(co[6], ": ")[[1]][2], recursive = TRUE)

Run the code above in your browser using DataLab