reduceByYield: Iterate through a BAM (or other) file, reducing output to a single result.

Description

Rsamtools files can be created with a ‘yieldSize’ argument that influences the number of records (chunk size) input at one time (see, e.g,. BamFile). reduceByYield iterates through the file, processing each chunk and reducing it with previously input chunks. This is a memory efficient way to process large data files, especially when the final result fits in memory.

Usage

reduceByYield(X, YIELD, MAP, REDUCE,  DONE = function(x) is.null(x) || length(x) == 0L,  ..., parallel = FALSE, iterate = TRUE, init)

Arguments

A BamFile instance (or other class for which isOpen, open, close methods are defined, and which support extraction of sequential chunks).

YIELD

A function name or user-supplied function that operates on X to produce a VALUE that is passed to DONE and MAP. Generally YIELD will be a data extractor such as readGAlignments, scanBam, yield, etc. and VALUE is a chunk of data.

YIELD(X)

MAP

A function of one or more arguments that operates on the chunk of data from YIELD.

MAP(VALUE, ...)

REDUCE

A function of one (iterate=FALSE or two (iterate=TRUE) arguments, returning the reduction (e.g., addition) of the argument(s). If missing, REDUCE is c (when iterate=TRUE) or identity when (when iterate=FALSE).

REDUCE(all.mapped, ...) ## iterate=FALSE
REDUCE(x, y, ...) ## iterate=TRUE

DONE

A function of one argument, the VALUE output of the most recent call to YIELD(X, ...). If missing, DONE is function(VALUE) length(VALUE) == 0.

...

Additional arguments, passed to MAP.

iterate

logical(1) determines whether the call to REDUCE is iterative (iterate=TRUE) or cumulative (iterate=FALSE).

parallel

logical(1) determines if the MAP step is run in parallel.

init

(Optional) Initial value used for REDUCE when iterate=TRUE.

Value

The return value is the value returned by the final invocation of REDUCE, or init if provided and no data were yield'ed, or list() if init is missing and no data were yield'ed.

Details

When iterate=TRUE, REDUCE is initially invoked with either the init value and the value of the first call to MAP or, if init is missing, the values of the first two calls to MAP.

When iterate=FALSE, REDUCE is invoked with a list containing a list with as many elements as there were calls to MAP. Each element the result of an invocation of MAP.

Examples

Run this code


if (all(require(RNAseqData.HNRNPC.bam.chr14) &&
        require(GenomicAlignments))) { 

  ## -----------------------------------------------------------------------
  ## Nucleotide frequency of mapped reads
  ## -----------------------------------------------------------------------
  
  ## In this example nucleotide frequency of mapped reads is computed
  ## for a single file. The MAP step is run in parallel and REDUCE 
  ## is iterative.
  
  fl <- system.file(package="Rsamtools", "extdata", "ex1.bam")
  bf <- BamFile(fl, yieldSize=500) ## typically, yieldSize=1e6
  
  param <- ScanBamParam(
      flag=scanBamFlag(isUnmappedQuery=FALSE),
      what="seq")
  YIELD <- function(X, ...) scanBam(X, param, ...)[[1]][['seq']]
  MAP <- function(value, ...) 
      alphabetFrequency(value, collapse=TRUE)
  REDUCE <- `+`        # add successive alphabetFrequency matrices 
  reduceByYield(bf, YIELD, MAP, REDUCE, param=param, parallel=TRUE)
  
  ## -----------------------------------------------------------------------
  ## Coverage
  ## -----------------------------------------------------------------------
  
  ## reduceByYield() can be applied to multiple files by combining it
  ## with bplapply().
  
  ## FUN will be run on each worker; it contains the necessary arguments 
  ## to reduceByYield() as well as a call to the function itself.
  ## reduceByYield() could also be run in parallel (parallel=TRUE) 
  ## but in this example it is not.
  FUN <- function(bf) {
    library(GenomicAlignments)
    library(GenomicFiles)
    YIELD <- `readGAlignments`
    MAP <- function(value, ...) coverage(value)[["chr14"]] 
    REDUCE <- `+`
    reduceByYield(bf, YIELD, MAP, REDUCE)
  }
  
  ## BAM files are distributed across Snow workers and each worker applies
  ## reduceByYield().
  bfl <- BamFileList(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1:3])
  bplapply(bfl, FUN, BPPARAM = SnowParam(3)) 
}