Learn R Programming

⚠️There's a newer version (0.9.9) of this package.Take me there.

ClusterMQ: send R function calls as cluster jobs

This package will allow you to send function calls as jobs on a computing cluster with a minimal interface provided by the Q function:

# load the library and create a simple function
library(clustermq)
fx = function(x) x * 2

# queue the function call on your scheduler
Q(fx, x=1:3, n_jobs=1)
# list(2,4,6)

Computations are done entirely on the network and without any temporary files on network-mounted storage, so there is no strain on the file system apart from starting up R once per job. All calculations are load-balanced, i.e. workers that get their jobs done faster will also receive more function calls to work on. This is especially useful if not all calls return after the same time, or one worker has a high load.

A full user guide is available here.

Installation

First, we need the ZeroMQ system library. Most likely, your package manager will provide this:

# You can skip this step on Windows and macOS, the rzmq binary has it
# On a computing cluster, we recommend to use Conda or Linuxbrew
brew install zeromq # Linuxbrew, Homebrew on macOS
conda install zeromq # Conda
sudo apt-get install libzmq3-dev # Ubuntu
sudo yum install zeromq-devel # Fedora
pacman -S zeromq # Arch Linux

Then install the clustermq package in R (which automatically installs the rzmq package as well) from CRAN:

install.packages('clustermq')

Alternatively you can use devtools to install directly from Github:

# install.packages('devtools')
devtools::install_github('mschubert/clustermq')
# devtools::install_github('mschubert/clustermq', ref="develop") # dev version

Schedulers

An HPC cluster's scheduler ensures that computing jobs are distributed to available worker nodes. Hence, this is what clustermq interfaces with in order to do computations.

We currently support the following schedulers (either locally or via SSH):

  • LSF - should work without setup
  • SGE - should work without setup
  • SLURM - should work without setup
  • PBS/Torque - needs options(clustermq.scheduler="PBS"/"Torque")
  • via SSH -

needs options(clustermq.scheduler="ssh", clustermq.ssh.host=<yourhost>)

If you need specific computing environments or containers, you can activate them via the scheduler template.

Usage

The most common arguments for Q are:

  • fun - The function to call. This needs to be self-sufficient (because it will not have access to the master environment)
  • ... - All iterated arguments passed to the function. If there is more than one, all of them need to be named
  • const - A named list of non-iterated arguments passed to fun
  • export - A named list of objects to export to the worker environment

The documentation for other arguments can be accessed by typing ?Q. Examples of using const and export would be:

# adding a constant argument
fx = function(x, y) x * 2 + y
Q(fx, x=1:3, const=list(y=10), n_jobs=1)
# exporting an object to workers
fx = function(x) x * 2 + y
Q(fx, x=1:3, export=list(y=10), n_jobs=1)

clustermq can also be used as a parallel backend for foreach. As this is also used by BiocParallel, we can run those packages on the cluster as well:

library(foreach)
register_dopar_cmq(n_jobs=2, memory=1024) # accepts same arguments as `workers`
foreach(i=1:3) %dopar% sqrt(i) # this will be executed as jobs

More examples are available in the user guide.

Comparison to other packages

There are some packages that provide high-level parallelization of R function calls on a computing cluster. We compared clustermq to BatchJobs and batchtools for processing many short-running jobs, and found it to have approximately 1000x less overhead cost (details on the wiki).

In short, use clustermq if you want:

  • a one-line solution to run cluster jobs with minimal setup
  • access cluster functions from your local Rstudio via SSH
  • fast processing of many function calls without network storage I/O

Use batchtools if you:

  • want to use a mature and well-tested package
  • don't mind that arguments to every call are written to/read from disc
  • don't mind there's no load-balancing at run-time

Use Snakemake or drake if:

  • you want to design and run a workflow on HPC

Don't use batch (last updated 2013) or BatchJobs (issues with SQLite on network-mounted storage).

Copy Link

Version

Install

install.packages('clustermq')

Monthly Downloads

1,563

Version

0.8.6

License

Apache License (== 2.0) | file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Michael Schubert

Last Published

April 20th, 2025

Functions in clustermq (0.8.6)

.onAttach

Report queueing system on package attach if not set
.onLoad

Select the queueing system on package loading
register_dopar_cmq

Register clustermq as `foreach` parallel handler
host

Construct the ZeroMQ host
master

Master controlling the workers
vec_lookup

Lookup table for return types to vector NAs
work_chunk

Function to process a chunk of calls
SLURM

SLURM scheduler functions
worker

R worker submitted as cluster job
purrr_lookup

Lookup table for return types to purrr functions
check_args

Function to check arguments with which Q() is called
chunk

Subset index chunk for processing
workers

Creates a pool of workers
clustermq

Evaluate Function Calls on HPC Schedulers (LSF, SGE, SLURM)
cmq_foreach

clustermq foreach handler
summarize_result

Print a summary of errors and warnings that occurred during processing
ssh_proxy

SSH proxy for different schedulers
QSys

Class for basic queuing system functions
LOCAL

Placeholder for local processing
MULTICORE

Process on multiple cores on one machine
Q

Queue function calls on the cluster
Q_rows

Queue function calls defined by rows in a data.frame
SSH

SSH scheduler functions
LSF

LSF scheduler functions
SGE

SGE scheduler functions
bind_avail

Binds an rzmq to an available port in given range