Workhorse function designed to handle large scRNA-Seq gene expression
matrices such as embedded Seurat matrices, and apply a function to columns of
the matrix split as a ragged array by an index factor, similar to tapply(),
by() or aggregate(). Note that here the index is applied to columns as
these represent cells in the single-cell format, rather than rows as in
aggregate(). Very large matrices are handled by slicing rows into blocks to
avoid excess memory requirements.
scapply(
x,
INDEX,
FUN,
combine = NULL,
combine2 = "c",
progress = TRUE,
sliceMem = 16,
cores = 1L,
...
)By default returns a list, unless combine is invoked in which case
the returned data type will depend on the functions specified by FUN and
combine.
matrix, sparse matrix or DelayedMatrix of raw counts with genes in rows and cells in columns.
a factor whose length matches the number of columns in x. It
is coerced to a factor. NA are tolerated and the matching columns in x
are skipped.
Function to be applied to each subblock of the matrix.
A function or a name of a function to apply to the list output to bind the final results together, e.g. 'cbind' or 'rbind' to return a matrix, or 'unlist' to return a vector.
A function or a name of a function to combine results after
slicing. As the function is usually applied to blocks of 30000 genes or so,
the result is usually a vector with an element per gene. Hence 'c' is the
default function for combining vectors into a single longer vector. However
if each gene returns a number of results (e.g. a vector or dataframe), then
combine2 could be set to 'rbind'.
Logical, whether to show progress.
Max amount of memory in GB to allow for each subsetted count
matrix object. When x is subsetted by each cell subclass, if the amount
of memory would be above sliceMem then slicing is activated and the
subsetted count matrix is divided into chunks and processed separately.
The limit is just under 17.2 GB (2^34 / 1e9). At this level the subsetted
matrix breaches the long vector limit (>2^31 elements).
Integer, number of cores to use for parallelisation using
mclapply(). Parallelisation is not available on windows. Warning:
parallelisation increases the memory requirement by multiples of
sliceMem.
Optional arguments passed to FUN.
Myles Lewis
The limit on sliceMem is that the number of elements manipulated in each
block must be
kept below the long vector limit of 2^31 (around 2e9). Increasing cores
requires substantial amounts of spare RAM. combine works
in a similar way to .combine in foreach(); it works across the levels in
INDEX. combine2 is nested and works across slices of genes (an inner
loop), so it is only invoked if slicing occurs which is when a matrix has a
larger memory footprint than sliceMem.
scmean() which applies a fixed function logmean() in a similar
manner, and slapply() which applies a function to a big matrix with
slicing but without splitting by an index factor.
# equivalent
m <- matrix(sample(0:100, 1000, replace = TRUE), nrow = 10)
cell_index <- sample(letters[1:5], 100, replace = TRUE)
o <- scmean(m, cell_index)
o2 <- scapply(m, cell_index, function(x) rowMeans(log2(x +1)),
combine = "cbind")
identical(o, o2)
Run the code above in your browser using DataLab