plapply: Simple parallelization of lapply

Description

Parses a large list into subsets and submits a separate batch R job that calls lapply on the subset. plapply has some features that may not be readily available in other parallelization functions like mclapply and parLapply:

The .Rout files produced by each R instance are easily accessible for convenient debugging of errors or warnings. The .Rout files can also serve as an explicit record of the work that was performed by the workers
Three options are available for the ordering of the processing of the list elements: the original list order, randomized, or collated (first-in-first-out).
In each R instance, pre-processing or post-processing steps can be performed before and after the call to lapply

These pre-processing and post-processing steps can depend on the instance of R, such that each instance can be treated differently, if desired. These features give greater control over the computing process, which can be especially useful for large jobs.

Usage

plapply(X, FUN, ..., njobs = parallel::detectCores() - 1, packages = NULL,
  header.file = NULL, needed.objects = NULL,
  needed.objects.env = parent.frame(), workDir = "plapply",
  clobber = TRUE, max.hours = 24, check.interval.sec = 1,
  collate = FALSE, random.seed = NULL, rout = NULL, clean.up = TRUE,
  verbose = FALSE)

Arguments

A list or vector, each element of which will be the input to FUN

FUN

A function whose first argument is an element of X

…

Additional named arguments to FUN

njobs

The number of jobs (subsets). Defaults to one less than the number of cores on the machine.

packages

Character vector giving the names of packages that will be loaded in each new instance of R, using library.

header.file

Text string indicating a file that will be initially sourced prior calling lapply in order to create an 'environment' that will satisfy all potential dependencies for FUN. If NULL, no file is sourced.

needed.objects

Character vector giving the names of objects which reside in the evironment specified by needed.objects.env that may be needed by FUN which are loaded into the global environment of each new instance of R that is launched. If NULL, no additional objects are passed.

needed.objects.env

Environment where needed.objects reside. This defaults to the environment in which plapply is called.

workDir

Character string giving the name of the working directory that will be used for for the files needed to launch the separate instances of R.

clobber

Logical indicating whether the directory designated by workDir will be overwritten if it exists and contains files. If clobber = FALSE, and workDir contains files, plapply throws an error.

max.hours

The maximum number of hours to wait for the njobs to complete.

check.interval.sec

The number of seconds to wait between checking to see whether all njobs have completed.

collate

= TRUE creates a 'first-in-first-out' processing order of the elements of the input list X. This logical is passed to the collate argument of parseJob.

random.seed

An integer setting the random seed, which will result in randomizing the elements of the list assigned to each job. This is useful when the computing time for each element varies significantly because it helps to even out the run times of the parallel jobs. If random.seed = NULL, no randomization is performed and the elements of the input list are subdivided sequentially among the jobs. This variable is passed to the random.seed argument of parseJob. If collate = TRUE, no randomization is performed and random.seed is ignored.

rout

A character string giving the name of the file to where all of the .Rout files will be gathered. If rout = NULL, the .Rout files are not gathered, but left alone in workDir.

clean.up

= TRUE will delete the working directory.

verbose

= TRUE prints messages which show the progress of the jobs.

Value

A list equivalent to that returned by lapply(X, FUN, ...).

Details

plapply applies FUN to each element of the list X by parsing the list into njobs lists of equal (or almost equal) size and then applies FUN to each sublist using lapply.

A separate batch instance of R is launched for each sublist, thus utilizing another core of the machine. After the jobs complete, the njobs output lists are reassembled. The global environments for each batch instance of R are created by writing/reading data to/from disc.

If collate = TRUE or random.seed = Integer value, the output list returned by plapply is reordered to reflect the original ordering of the input list, X.

An object called process.id (consisting of an integer indicating the process number) is available in the global environment of each instance of R.

Each instance of R runs a script that performs the following steps:

Any other packages indicated in the packages argument are loaded via calls to library()
The process.id global variable is assigned to the global environment of the R instance (having been passed in via a command line argument)
The header file (if there is one) is sourced
The expression pre.process.expression is evaluated if an object of that name is present in the global environment. The object pre.process.expression may be passed in via the header file or via needed.objects
lapply is called on the sublist, the sublist is called X.i
The expression post.process.expression is evaluated if an object of that name is present in the global environment. The object post.process.expression may be passed in via the header file or via needed.objects
The output returned by lapply is assigned to the object X.i.out, and is saved to a temporary file where it will be collected after all jobs have completed
Warnings are printed

If njobs = 1, none of the previous steps are executed, only this call is made: lapply(X, FUN, ...)

Examples

Run this code

# NOT RUN {
# Create a simple list
a <- list(a = rnorm(10), b = rnorm(20), c = rnorm(15), d = rnorm(13),
          e = rnorm(15), f = rnorm(22))

# Some objects that will be needed by f1:
b1 <- rexp(20)
b2 <- rpois(10, 20)

# The function
f1 <- function(x) mean(x) + max(b1) - min(b2)

# Call plapply
res1 <- plapply(a, f1, njobs = 2, needed.objects = c("b1", "b2"),
                check.interval.sec = 0.5, max.hours = 1/120,
                workDir = "example1", rout = "example1.Rout",
                clean.up = FALSE)

print(res1)

# Look at the collated 'Rout' file
more("example1.Rout")

# Look at the contents of the working directory
dir("example1")

# Remove working directory and Rout file
unlink("example1", recursive = TRUE, force = TRUE)
unlink("example1.Rout")
 
# Verify the result with lapply
res2 <- lapply(a, f1)

# Compare results
identical(res1, res2)
# }

Run the code above in your browser using DataLab