chunkedMap: Applies a Function on Each Chunk of a File-Backed Matrix.

Description

Similar to base::lapply(), but designed for file-backed matrices. The function brings chunks of an object into physical memory by taking subsets, and applies a function on them. If nCores is greater than 1, the function will be applied in parallel using parallel::mclapply(). In that case the subsets of the object are taken on the slaves.

Usage

chunkedMap(X, FUN, i = seq_len(nrow(X)), j = seq_len(ncol(X)),
  chunkBy = 2L, chunkSize = 5000L, nCores = getOption("mc.cores",
  2L), verbose = FALSE, ...)

Arguments

A file-backed matrix, typically @geno of a '>BGData object.

FUN

The function to be applied on each chunk.

Indicates which rows of X should be used. Can be integer, boolean, or character. By default, all rows are used.

Indicates which columns of X should be used. Can be integer, boolean, or character. By default, all columns are used.

chunkBy

Whether to extract chunks by rows (1) or by columns (2). Defaults to columns (2).

chunkSize

The number of rows or columns of X that are brought into physical memory for processing per core. If NULL, all elements in i or j are used. Defaults to 5000.

nCores

The number of cores (passed to parallel::mclapply()). Defaults to the number of cores as detected by parallel::detectCores().

verbose

Whether progress updates will be posted. Defaults to FALSE.

...

Additional arguments to be passed to the base::apply() like function.

File-backed matrices

Functions with the chunkSize parameter work best with file-backed matrices such as BEDMatrix::BEDMatrix objects. To avoid loading the whole, potentially very large matrix into memory, these functions will load chunks of the file-backed matrix into memory and perform the operations on one chunk at a time. The size of the chunks is determined by the chunkSize parameter. Care must be taken to not set chunkSize too high to avoid memory shortage, particularly when combined with parallel computing.

Multi-level parallelism

Functions with the nCores, i, and j parameters provide capabilities for both parallel and distributed computing.

For parallel computing, nCores determines the number of cores the code is run on. Memory usage can be an issue for higher values of nCores as R is not particularly memory-efficient. As a rule of thumb, at least around (nCores * object_size(chunk)) + object_size(result) MB of total memory will be needed for operations on file-backed matrices, not including potential copies of your data that might be created (for example stats::lsfit() runs cbind(1, X)). i and j can be used to include or exclude certain rows or columns. Internally, the parallel::mclapply() function is used and therefore parallel computing will not work on Windows machines.

For distributed computing, i and j determine the subset of the input matrix that the code runs on. In an HPC environment, this can be used not just to include or exclude certain rows or columns, but also to partition the task among many nodes rather than cores. Scheduler-specific code and code to aggregate the results need to be written by the user. It is recommended to set nCores to 1 as nodes are often cheaper than cores.

Examples

Run this code

# NOT RUN {
# Restrict number of cores to 1 on Windows
if (.Platform$OS.type == "windows") {
    options(mc.cores = 1)
}

# Load example data
bg <- BGData:::loadExample()

# Compute column sums of each chunk
chunkedMap(X = bg@geno, FUN = colSums)
# }

Run the code above in your browser using DataLab