windowCounts: Count reads overlapping each window

Description

Count the number of extended reads overlapping a sliding window at spaced positions across the genome.

Usage

windowCounts(bam.files, spacing=50, width=spacing, ext=100, shift=0,
	filter=10, bin=FALSE, param=readParam())

Arguments

bam.files

a character vector containing paths to sorted and indexed BAM files

spacing

an integer scalar specifying the distance between consecutive windows

width

an integer scalar specifying the width of the window

ext

an integer scalar or vector, or a list of two integer scalars/vectors, containing the average length(s) of the sequenced fragments in each library

shift

an integer scalar specifying how much the start of each window should be shifted to the left

filter

an integer scalar for the minimum count sum across libraries for each window

bin

a logical scalar indicating whether binning should be performed

param

a readParam object containing read extraction parameters, or a list of such objects (one for each BAM file)

Value

A RangedSummarizedExperiment object is returned containing one integer matrix. Each entry of the matrix contains the count for each library (column) at each window (row). The coordinates of each window are stored as the rowRanges. The total number of reads in each library are stored as totals in the colData, along with the read extension length and param used in each library. Other window counting parameters (e.g., spacing, width) are stored in the metadata.

Defining the sliding windows

A window is defined as a genomic interval of size equal to width. The value of width can be interpreted as the width of the contact area between the DNA and protein. In practical terms, it determines the spatial resolution of the analysis. Larger windows count reads over a larger region which results in larger counts. This results in greater detection power at the cost of resolution. By default, the first window on a chromosome starts at base position 1. This can be shifted to the left by specifying an appropriate value for shift. New windows are found by sliding the current window to the right by the specified spacing. Increasing spacing will reduce the frequency at which counts are extracted from the genome. This results in some loss of resolution but it may be necessary when machine memory is limited. If bin is set, settings are internally adjusted so that all reads are counted into non-overlapping adjacent bins of size width. Specifically, spacing is set to width and filter is capped at a maximum value of 1 (empty bins can be retained with filter=0). Only the 5' end of each read or the midpoint of each fragment (for paired-end data) is used in counting.

Read extraction and counting

Read extraction from the BAM files is governed by the param argument. Specifying a single readParam object will use the same extraction parameters for all files. Different parameters can also be used for each file by specifying a list of readParam objects. However, users should take care with library-specific parameters, lest spurious differences be introduced between libraries. Fragments are inferred from reads by directional extension (single-end; see below) or by identifying proper pairs (paired-end; see readParam for more details). The number of fragments overlapping the window for each library is then counted for each window position. Windows will be removed if the count sum across all libraries is below filter. This reduces the memory footprint of the output by not returning empty or near-empty windows, which are usually uninteresting anyway. The strandedness of the output rowRanges is set based on the strand(s) from which the reads are extracted and counted. This is determined by the value of the forward slot in the input param object.

Elaborating on directional extension

For single-end reads, directional extension is performed whereby each read is extended from its 3' end to the average fragment length, i.e., ext. This obtains a rough estimate of the interval of the fragment from which the read was derived. It is particularly useful for TF data, where extension specifically increases the coverage of peaks that exhibit strand bimodality. Substantially different fragment lengths between libraries can be accommodated by supplying a vector to ext, where each entry represents the extension length for the corresponding library. If any value of ext is set to NA, the read length is used as the fragment length in that library. However, different fragment lengths will result in different peak widths between libraries. This may result in the detection of irrelevant differences corresponding to these differences in widths. To avoid this, fragment lengths in all libraries can be scaled to final.ext. For a bimodal peak, scaling effectively aligns the subpeaks on a given strand across all libraries to a common location. This removes the most obvious differences in widths. The value of final.ext can be specified by passing a list (or equivalently subsettable object) of integer vectors in the ext argument. The library-specific fragment lengths are taken from the first element of the list, while the final fragment length is defined from the second element. To this end, the second element should be either a scalar or a vector containing a single duplicated value. If final.ext is set to NA, no rescaling is performed from the library-specific fragment lengths.

Comments on ext for paired-end data

Directional extension is not performed for paired-end data, so the values in ext are not used directly. Hwoever, rescaling can still be performed to standardize fragment lengths across libraries, by resizing each fragment from its midpoint. This will use the second element of ext as final.ext, if ext is specified as a list of length 2. On a similar note, some downstream functions will use the extension length in the output colData as the average fragment length. Thus, to maintain compatibility, the extension length in the output colData is set to the average of the inferred fragment lengths for valid read pairs. These values will not be used in windowCounts, but instead, in functions like getWidths.

Examples

Run this code

# A low filter is only used here as the examples have very few reads.
bamFiles <- system.file("exdata", c("rep1.bam", "rep2.bam"), package="csaw")
windowCounts(bamFiles, filter=1)
windowCounts(bamFiles, width=100, filter=1)
windowCounts(bamFiles, ext=c(50, 100), spacing=100, filter=1)
windowCounts(bamFiles, ext=DataFrame(c(50, 100), 80), width=100)

# Loading PE data.
bamFile <- system.file("exdata", "pet.bam", package="csaw")
windowCounts(bamFile, param=readParam(pe="both"), filter=1)
windowCounts(bamFile, param=readParam(pe="first"), filter=1)
windowCounts(bamFile, param=readParam(max.frag=100, pe="both"), filter=1)
windowCounts(bamFile, param=readParam(max.frag=100, pe="both", restrict="chrA"), filter=1)

Run the code above in your browser using DataLab