Parses a large list into subsets and submits a separate batch R job that calls lapply
on the subset. plapply
has some features that may not be readily available in
other parallelization functions like mclapply
and parLapply
:
The .Rout
files produced by each R instance are easily accessible
for convenient debugging of errors or warnings. The .Rout
files
can also serve as an explicit record of the work that
was performed by the workers
Three options are available for the ordering of the processing of the list elements: the original list order, randomized, or collated (first-in-first-out).
In each R instance, pre-processing or post-processing steps can be performed
before and after the call to lapply
These pre-processing and post-processing steps can depend on the instance of R, such that each instance can be treated differently, if desired. These features give greater control over the computing process, which can be especially useful for large jobs.
plapply(X, FUN, ..., njobs = parallel::detectCores() - 1, packages = NULL,
header.file = NULL, needed.objects = NULL,
needed.objects.env = parent.frame(), workDir = "plapply",
clobber = TRUE, max.hours = 24, check.interval.sec = 1,
collate = FALSE, random.seed = NULL, rout = NULL, clean.up = TRUE,
verbose = FALSE)
A list or vector, each element of which will be the input to FUN
A function whose first argument is an element of X
Additional named arguments to FUN
The number of jobs (subsets). Defaults to one less than the number of cores on the machine.
Character vector giving the names of packages that will be
loaded in each new instance of R, using library
.
Text string indicating a file that will be initially
sourced prior calling lapply
in order to create an
'environment' that will satisfy all potential dependencies for FUN
.
If NULL
, no file is sourced.
Character vector giving the names of objects which
reside in the evironment specified by needed.objects.env
that may be
needed by FUN
which are loaded into the global environment of each
new instance of R that is launched. If NULL
, no additional objects
are passed.
Environment where needed.objects
reside.
This defaults to the environment in which plapply
is called.
Character string giving the name of the working directory that will be used for for the files needed to launch the separate instances of R.
Logical indicating whether the directory designated by workDir
will be overwritten if it exists and contains files. If clobber = FALSE
,
and workDir
contains files, plapply
throws an error.
The maximum number of hours to wait for the njobs
to complete.
The number of seconds to wait between checking to
see whether all njobs
have completed.
= TRUE
creates a 'first-in-first-out' processing order of
the elements of the input list X
. This logical is passed to the
collate
argument of parseJob
.
An integer setting the random seed, which will result in
randomizing the elements of the list assigned to each job. This is useful
when the computing time for each element varies significantly because it
helps to even out the run times of the parallel jobs. If random.seed
= NULL
, no randomization is performed and the elements of the input list
are subdivided sequentially among the jobs. This variable is passed to the
random.seed
argument of parseJob
. If collate = TRUE
,
no randomization is performed and random.seed
is ignored.
A character string giving the name of the file to where all of the .Rout
files
will be gathered. If rout = NULL
, the .Rout
files are not gathered, but left
alone in workDir
.
= TRUE
will delete the working directory.
= TRUE
prints messages which show the progress of the
jobs.
A list equivalent to that returned by lapply(X, FUN, ...)
.
plapply
applies FUN
to each element of the list X
by
parsing the list into njobs
lists of equal (or almost equal) size
and then applies FUN
to each sublist using lapply
.
A separate batch instance of R is launched for each sublist, thus utilizing
another core of the machine. After the jobs complete, the njobs
output lists are reassembled. The global environments for each batch instance
of R are created by writing/reading data to/from disc.
If collate = TRUE
or random.seed = Integer value
, the output
list returned by plapply
is reordered to reflect the original
ordering of the input list, X
.
An object called process.id
(consisting of an integer indicating the
process number) is available in the global environment of each instance of
R.
Each instance of R runs a script that performs the following steps:
Any other packages indicated in the packages
argument are
loaded via calls to library()
The process.id
global variable is assigned to the global
environment of the R instance (having been passed
in via a command line argument)
The header file (if there is one) is sourced
The expression pre.process.expression
is evaluated if an
object of that name is present in the global environment. The object
pre.process.expression
may be passed in via the header file or via
needed.objects
lapply
is called on the sublist, the sublist is called
X.i
The expression post.process.expression
is evaluated if an
object of that name is present in the global environment. The object
post.process.expression
may be passed in via the header file or via
needed.objects
The output returned by lapply
is assigned to the object
X.i.out
, and is saved to a temporary file
where it will be collected after all jobs have completed
Warnings are printed
If njobs = 1
, none of the previous steps are executed, only this
call is made: lapply(X, FUN, ...)
# NOT RUN {
# Create a simple list
a <- list(a = rnorm(10), b = rnorm(20), c = rnorm(15), d = rnorm(13),
e = rnorm(15), f = rnorm(22))
# Some objects that will be needed by f1:
b1 <- rexp(20)
b2 <- rpois(10, 20)
# The function
f1 <- function(x) mean(x) + max(b1) - min(b2)
# Call plapply
res1 <- plapply(a, f1, njobs = 2, needed.objects = c("b1", "b2"),
check.interval.sec = 0.5, max.hours = 1/120,
workDir = "example1", rout = "example1.Rout",
clean.up = FALSE)
print(res1)
# Look at the collated 'Rout' file
more("example1.Rout")
# Look at the contents of the working directory
dir("example1")
# Remove working directory and Rout file
unlink("example1", recursive = TRUE, force = TRUE)
unlink("example1.Rout")
# Verify the result with lapply
res2 <- lapply(a, f1)
# Compare results
identical(res1, res2)
# }
Run the code above in your browser using DataLab