Similar to base::lapply()
, but designed for file-backed matrices. The
function brings chunks of an object into physical memory by taking subsets,
and applies a function on them. If nCores
is greater than 1, the function
will be applied in parallel using parallel::mclapply()
. In that case the
subsets of the object are taken on the slaves.
chunkedMap(X, FUN, i = seq_len(nrow(X)), j = seq_len(ncol(X)),
chunkBy = 2L, chunkSize = 5000L, nCores = getOption("mc.cores",
2L), verbose = FALSE, ...)
The function to be applied on each chunk.
Indicates which rows of X
should be used. Can be integer,
boolean, or character. By default, all rows are used.
Indicates which columns of X
should be used. Can be integer,
boolean, or character. By default, all columns are used.
Whether to extract chunks by rows (1) or by columns (2). Defaults to columns (2).
The number of rows or columns of X
that are brought into
physical memory for processing per core. If NULL
, all elements in i
or
j
are used. Defaults to 5000.
The number of cores (passed to parallel::mclapply()
).
Defaults to the number of cores as detected by parallel::detectCores()
.
Whether progress updates will be posted. Defaults to FALSE
.
Additional arguments to be passed to the base::apply()
like
function.
Functions with the chunkSize
parameter work best with file-backed matrices
such as BEDMatrix::BEDMatrix objects. To avoid loading the whole,
potentially very large matrix into memory, these functions will load chunks
of the file-backed matrix into memory and perform the operations on one
chunk at a time. The size of the chunks is determined by the chunkSize
parameter. Care must be taken to not set chunkSize
too high to avoid
memory shortage, particularly when combined with parallel computing.
Functions with the nCores
, i
, and j
parameters provide
capabilities for both parallel and distributed computing.
For parallel computing, nCores
determines the number of cores the code is
run on. Memory usage can be an issue for higher values of nCores
as R is
not particularly memory-efficient. As a rule of thumb, at least around
(nCores * object_size(chunk)) + object_size(result)
MB of total memory
will be needed for operations on file-backed matrices, not including
potential copies of your data that might be created (for example
stats::lsfit()
runs cbind(1, X)
). i
and j
can be used to include or
exclude certain rows or columns. Internally, the parallel::mclapply()
function is used and therefore parallel computing will not work on Windows
machines.
For distributed computing, i
and j
determine the subset of the input
matrix that the code runs on. In an HPC environment, this can be used not
just to include or exclude certain rows or columns, but also to partition
the task among many nodes rather than cores. Scheduler-specific code and
code to aggregate the results need to be written by the user. It is
recommended to set nCores
to 1
as nodes are often cheaper than cores.
# NOT RUN {
# Restrict number of cores to 1 on Windows
if (.Platform$OS.type == "windows") {
options(mc.cores = 1)
}
# Load example data
bg <- BGData:::loadExample()
# Compute column sums of each chunk
chunkedMap(X = bg@geno, FUN = colSums)
# }
Run the code above in your browser using DataCamp Workspace