preprocessCoverage: Transform and split the data

Description

This function takes the coverage data from loadCoverage, scales the data, does the log2 transformation, and splits it into appropriate chunks for using calculateStats.

Usage

preprocessCoverage(coverageInfo, groupInfo = NULL, cutoff = 5,
  colsubset = NULL, lowMemDir = NULL, ...)

Arguments

coverageInfo

A list containing a DataFrame --$coverage-- with the coverage data and a logical Rle --$position-- with the positions that passed the cutoff. This object is generated using loadCoverage.

groupInfo

A factor specifying the group membership of each sample. If NULL no group mean coverages are calculated. If the factor has more than one level, the first one will be used to calculate the log2 fold change in calculatePvalues.

cutoff

The base-pair level cutoff to use. It's behavior is controlled by filter.

colsubset

Optional vector of column indices of coverageInfo$coverage that denote samples you wish to include in analysis.

lowMemDir

If specified, each chunk is saved into a separate Rdata file under lowMemDir and later loaded in fstats.apply when running calculateStats and calculatePvalues. Using this option helps reduce the memory load as each fork in bplapply loads only the data needed for the chunk processing. The downside is a bit longer computation time due to input/output.

...

Arguments passed to other methods and/or advanced arguments.

Value

A list with five components. [object Object],[object Object],[object Object],[object Object],[object Object]

Details

If chunksize is NULL, then mc.cores is used to determine the chunksize. This is useful if you want to split the data so each core gets the same amount of data (up to rounding).

Computing the indexes and using those for mclapply reduces memory copying as described by Ryan Thompson and illustrated in approach #4 at http://lcolladotor.github.io/2013/11/14/Reducing-memory-overhead-when-using-mclapply

If lowMemDir is specified then $coverageProcessed is NULL and $mclapplyIndex is a vector with the chunk identifiers.

Examples

Run this code

## Split the data and transform appropriately before using calculateStats()
dataReady <- preprocessCoverage(genomeData, cutoff = 0, scalefac = 32, 
    chunksize = 1e3, colsubset = NULL, verbose = TRUE)
names(dataReady)
dataReady

Run the code above in your browser using DataLab