scapply: Single-cell apply a function to a matrix split by a factor

Description

Workhorse function designed to handle large scRNA-Seq gene expression matrices such as embedded Seurat matrices, and apply a function to columns of the matrix split as a ragged array by an index factor, similar to tapply(), by() or aggregate(). Note that here the index is applied to columns as these represent cells in the single-cell format, rather than rows as in aggregate(). Very large matrices are handled by slicing rows into blocks to avoid excess memory requirements.

Usage

scapply(
  x,
  INDEX,
  FUN,
  combine = NULL,
  combine2 = "c",
  progress = TRUE,
  sliceMem = 16,
  cores = 1L,
  ...
)

Value

By default returns a list, unless combine is invoked in which case the returned data type will depend on the functions specified by FUN and combine.

Arguments

x: matrix, sparse matrix or DelayedMatrix of raw counts with genes in rows and cells in columns.
INDEX: a factor whose length matches the number of columns in x. It is coerced to a factor. NA are tolerated and the matching columns in x are skipped.
FUN: Function to be applied to each subblock of the matrix.
combine: A function or a name of a function to apply to the list output to bind the final results together, e.g. 'cbind' or 'rbind' to return a matrix, or 'unlist' to return a vector.
combine2: A function or a name of a function to combine results after slicing. As the function is usually applied to blocks of 30000 genes or so, the result is usually a vector with an element per gene. Hence 'c' is the default function for combining vectors into a single longer vector. However if each gene returns a number of results (e.g. a vector or dataframe), then combine2 could be set to 'rbind'.
progress: Logical, whether to show progress.
sliceMem: Max amount of memory in GB to allow for each subsetted count matrix object. When x is subsetted by each cell subclass, if the amount of memory would be above sliceMem then slicing is activated and the subsetted count matrix is divided into chunks and processed separately. The limit is just under 17.2 GB (2^34 / 1e9). At this level the subsetted matrix breaches the long vector limit (>2^31 elements).
cores: Integer, number of cores to use for parallelisation using mclapply(). Parallelisation is not available on windows. Warning: parallelisation increases the memory requirement by multiples of sliceMem.
...: Optional arguments passed to FUN.

Author

Myles Lewis

Details

The limit on sliceMem is that the number of elements manipulated in each block must be kept below the long vector limit of 2^31 (around 2e9). Increasing cores requires substantial amounts of spare RAM. combine works in a similar way to .combine in foreach(); it works across the levels in INDEX. combine2 is nested and works across slices of genes (an inner loop), so it is only invoked if slicing occurs which is when a matrix has a larger memory footprint than sliceMem.

Examples

Run this code

# equivalent
m <- matrix(sample(0:100, 1000, replace = TRUE), nrow = 10)
cell_index <- sample(letters[1:5], 100, replace = TRUE)
o <- scmean(m, cell_index)
o2 <- scapply(m, cell_index, function(x) rowMeans(log2(x +1)),
              combine = "cbind")
identical(o, o2)

Run the code above in your browser using DataLab