makeClusterFuture: Create a Future Cluster of Stateless Workers for Parallel Processing

Description

WARNING: Please note that this sets up a stateless set of cluster nodes, which means that clusterEvalQ(cl, { a <- 3.14 }) will have no effect. Consider this a first beta version and use it with great care, particularly because of the stateless nature of the cluster. For now, I recommend to manually validate that you can get identical results using this cluster type with what you get from using the classical parallel::makeCluster() cluster type.

Usage

makeClusterFuture(specs = nbrOfWorkers(), ...)

Value

Returns a parallel

cluster object of class FutureCluster.

Arguments

specs: Ignored. If specified, the value should equal nbrOfWorkers() (default). A missing value corresponds to specifying nbrOfWorkers(). This argument exists only to support parallel::makeCluster(NA, type = future::FUTURE).
...: Named arguments passed to future().

Future Clusters are Stateless

Traditionally, a cluster nodes has a one-to-one mapping to a cluster worker process. For example, cl <- makeCluster(2, type = "PSOCK") launches two parallel worker processes in the background, where cluster node cl[[1]] maps to worker #1 and node cl[[2]] to worker #2, and that never changes through the lifespan of these workers. This one-to-one mapping allows for deterministic configuration of workers. For examples, some code may assign globals with values specific to each worker, e.g. clusterEvalQ(cl[1], { a <- 3.14 }) and clusterEvalQ(cl[2], { a <- 2.71 }).

In contrast, there is no one-to-one mapping between cluster nodes and the parallel workers when using a future cluster. This is because we cannot make assumptions on where are parallel task will be processed. Where a parallel task is processes is up to the future backend to decide - some backends do this deterministically, whereas others other resolves task at the first available worker. Also, the worker processes might be transient for some future backends, i.e. the only exist for the life-span of the parallel task and then terminates.

Because of this, one must not rely in node-specific behaviors, because that concept does not make sense with a future cluster. To protect against this, any attempt to address a subset of future cluster nodes, results in an error, e.g. clusterEvalQ(cl[1], ...), clusterEvalQ(cl[1:2], ...), and clusterEvalQ(cl[2:1], ...) in the above example will all give an error.

Exceptions to the latter limitation are clusterSetRNGStream() and clusterExport(), which can be safely used with future clusters. See below for more details. If clusterEvalQ() is called, the call is ignored, and a warning is produced.

clusterSetRNGStream

parallel::clusterSetRNGStream() distributes "L'Ecuyer-CMRG" RNG streams to the cluster nodes, which record them such that the next round of futures will use them. When used, the RNG state after the futures are resolved are recorded accordingly, such that the next round again of future will use those, and so on. This strategy makes sure clusterSetRNGStream() has the expected effect although futures are stateless.

clusterExport

parallel::clusterExport() assign values to the cluster nodes. Specifically, these values are recorded and are used as globals for all futures created there on.

Examples

Run this code

if (FALSE) { # (getRversion() >= "4.4.0")
plan(multisession)
cl <- makeClusterFuture()

parallel::clusterSetRNGStream(cl)

y <- parallel::parLapply(cl, 11:13, function(x) {
  message("Process ID: ", Sys.getpid())
  mean(rnorm(n = x))
})
str(y)

plan(sequential)
}

Run the code above in your browser using DataLab