parallelize: Subject a function to dynamic parallelization

Description

This function executes all necessary steps to perform a dynamic analysis of parallelism of a given function, create objects that encapsulate code that can be executed in parallel, transfer this to a execution backend which potentially is a remote target, recollect results and resume execution in a transparent way.

Usage

parallelize(.f, ..., Lapply_local = rget("Lapply_local", default = FALSE), parallelize_wait = TRUE)
parallelize_call(.call, ..., parallelize_wait = TRUE)

Arguments

Function to be parallelized given as a function object.

...

Arguments passed to .f.

Lapply_local

Force local execution.

parallelize_wait

Force to poll completion of computation if backend returns asynchroneously

.call

Unevaluated call to be parallelized

Value

The value returned is the result of the computation of function .f

Important details

The package creates files in a given folder. This folder is not temporary as it might be needed across R sessions. md5-fingerprints of textual representations of function calls are used to create subfolders therein per parallelize call. Conflicts can be avoided by choosing different function names per parallelize call for parallelizing the same function several times. For example: f0 = f1 = function(){42}; parallelize(f0); parallelize(f1);
In view of efficiency the parallize.dynamic package does not copy the whole workspace to all parallel jobs. Instead, a hopefully minimal environment is set up for each parallel job. This includes all parameters passed to the function in question together with variables defined in the closure. This leaves all functions undefined and it is therefore expected that all necessary function defintions are given in separate R-files or libraries which have to be specified in the config variable. These are then sourced/loaded into the parallel job. A way to dynamically create source files from function definitions is given in the Examples section.

Details

Function parallelize and parallelize_call both perform a dynamic parallelization of the computation of .f, i.e. parallelism is determined at run-time. Points of potential parallelism have to be indicated by the use of Apply/Sapply/Lapply (collectively Apply functions) instead of apply/sapply/lapply in existing code. The semantics of the new function is exactly the same as that of the original functions such that a simple upper-casing of these function calls makes existing programs amenable to parallelization. Parallelize will execute the function .f, recording the number of elements that are passed to Apply functions. Once a given threshold (degree of parallelization) is reached computation is stopped and remaining executions are done in parallel. A call parallelize_initialize function determines the precise mechanism of parallel execution and can be used to flexibly switch between different resources.

Examples

Run this code

  # code to be parallelized
  parallel8 = function(e) log(1:e) %*% log(1:e);
  parallel2 = function(e) rep(e, e) %*% 1:e * 1:e;
  parallel1 = function(e) Lapply(rep(e, 15), parallel2);
  parallel0 = function() {
    r = sapply(Lapply(1:50, parallel1),
      function(e)sum(as.vector(unlist(e))));
    r0 = Lapply(1:49, parallel8);
    r
  }

  # create file that can be sourced containing function definitions
  # best practice is to define all needed functions in files that
  # can be sourced. The function tempcodefile allows to create a
  # temporary file with the defintion of given functions
  codeFile = tempcodefile(c(parallel0, parallel1, parallel2, parallel8));

  # definitions of clusters
  Parallelize_config = list(max_depth = 5, parallel_count = 24, offline = FALSE,
  backends = list(
    snow = list(localNodes = 2, splitN = 1, sourceFiles = codeFile),
    local = list(
      path = sprintf('%s/tmp/parallelize', tempdir())
    )
  ));

  # initialize
  parallelize_initialize(Parallelize_config, backend = 'local');

  # perform parallelization
  r0 = parallelize(parallel0);
  print(r0);

  # same
  r1 = parallelize_call(parallel0());
  print(r1);

  # compare with native execution
  parallelize_initialize(backend = 'off');
  r2 = parallelize(parallel0);
  print(r2);

  # put on SNOW cluster
  parallelize_initialize(Parallelize_config, backend = 'snow');
  r3 = parallelize(parallel0);
  print(r3);

  # analyse parallelization
  parallelize_initialize(Parallelize_config, backend = 'local');
  Log.setLevel(5);
  r4 = parallelize(parallel0);
  Log.setLevel(6);
  r5 = parallelize(parallel0);

  print(sprintf('All results are the same is %s', as.character(
    all(sapply(list(r0, r1, r2, r3, r4, r5), function(l)all(l == r0)))
  )));

Run the code above in your browser using DataLab