doMPI-package:

Description

The doMPI package provides a parallel backend for the foreach package. It is similar to the doSNOW package, but uses Rmpi directly. This allows it to do more, and execute more efficiently. It can also make use of the multicore package to execute tasks across multiple cores on the worker nodes. This is can give very good performance on a computer cluster with multicore processors.

Arguments

Details

There are several backend-specific options that can be specified when using doMPI. They are specified to foreach as a list using the .options.mpi argument. The currently supported options are:

`chunkSize`	Number of tasks to send at a time to the cluster workers
`info`	Display extra information, particularly about exported variables
`initEnvir`	A function to be called on each worker before executing any tasks
`initArgs`	List of extra arguments to pass to the `initEnvir` function
`initEnvirMaster`	A function called on the master at the same time as `initEnvir`
`initArgsMaster`	List of extra arguments to pass to the `initEnvirMaster` function
`finalEnvir`	A function to be called on each worker after executing all tasks
`finalArgs`	List of extra arguments to pass to the `finalEnvir` function
`profile`	Display profiling information from the master's point of view
`bcastThreshold`	Used to decide whether to piggy-back or broadcast job data
`forcePiggyback`	Always piggy-back job environment with first task to each worker
`nocompile`	Don't compile the R expression
`seed`	Starting seed for tasks

The chunkSize option is particularly important, since it can be much more efficient to send more than one task at a time to the workers, particularly when the tasks execute quickly. Also, it can allow the workers to execute those tasks in parallel using the mclapply function from the multicore package. The default value is 1.

The info option is used to print general information that is specific to the doMPI backend. This includes information on what variables are exported, for example. The default value is FALSE.

The initEnvir option is useful for preparing the workers to execute the subsequent tasks. The execution environment is passed as the first argument to this function. That allows you to define new variables in the environment, for example. If initArgs is defined, the contents of the list will be passed as arguments to the initEnvir function after the environment object.

The initEnvirMaster option is useful if you want to send data from the master to the workers explicitly, perhaps using mpi.bcast. This avoids object serialization, which could improve performance for large matrices, for example. The initArgsMaster option works like initArgs, however, it is probably less useful, since the initEnvirMaster function runs locally, and can access variables via lexical scoping.

The finalEnvir option is useful for “finalizing” the execution environment. It works pretty much the same as the initEnvir function, getting extra arguments from a list specified with the finalArgs option.

The profile option is used to print profiling information at the end of the %dopar% execution. It basically lists the time spent sending tasks to the workers and retrieving results from them. The default value is FALSE.

The bcastThreshold option is used to decide whether to piggy-back the job data, or broadcast it. The job data is serialized, and if it is smaller than bcastThreshold, it is piggy-backed, otherwise, it is broadcast. Note that if you want to force piggy-backing, you should use the forcePiggyback, rather than setting bcastThreshold to a very large value. That avoids serializing the job data twice, which can be time consuming.

The forcePiggyback option is used to force the job data to be “piggy-backed” with the first task to each of the workers. If the value is FALSE, the data may still be piggy-backed, but it is not guaranteed. In general, the job data is only piggy-backed if it is relatively small. The default value is FALSE.

The nocompile option is used to disable compilation of the R expression in the body of the foreach loop. The default value is FALSE.

The seed option is used for achieving reproducible results. If set to a single numeric value, such as 27, it is converted to a value that can be passed to the nextRNGSubStream function from the parallel package. This value is assigned to the global .Random.seed variable on some cluster worker when it executes the first task (or task chunk). The nextRNGSubStream function is used to generate subsequent values that are assigned to .Random.seed when executing subsequent tasks. Thus, RNG substreams are associated with tasks, rather than workers. This is necessary for reproducible results, since the doMPI package uses load balancing techniques that can result in different tasks being executed by different workers on different runs of the same foreach loop. The default value of the seed option is NULL.

Additional documentation is available on the following functions:

`startMPIcluster`	Create and start an MPI cluster object
`registerDoMPI`	Register a cluster object to be used with %dopar%
`closeCluster`	Shutdown and close a cluster object
`clusterSize`	Return the number of workers associated with a cluster object
`setRngDoMPI`	Initialize parallel random number generation on a cluster

For a complete list of functions with individual help pages, use library(help="doMPI"). Use the command vignette("doMPI") to view the vignette entitled “Introduction to doMPI”. Also, there are a number of doMPI example scripts in the examples directory of the doMPI installation.