getClusters: Identify clusters containing high-confidence substitutions and resolve boundaries at high resolution

Description

Identifies clusters using either the mini-rank norm (MRN) algorithm (default and recommended to achieve highest sensitivity) or via a continuous wavelet transform (CWT) based approach. The former employs thresholding of background coverage differences and finds the optimal cluster boundaries by exhaustively evaluating all putative clusters using a rank-based approach. This method has higher sensitivity and an approximately 10-fold faster running time than the CWT-based cluster identification algorithm. The latter, maintained for compatibility with wavClusteR, computes the CWT on a 1 kb window of the coverage function centered at a high-confidence substitution site, and identifies cluster boundaries by extending away from peak positions.

Usage

getClusters(highConfSub, coverage, sortedBam, method = 'mrn', cores =
1, threshold, step = 1, snr = 3)

Arguments

highConfSub

GRanges object containing high-confidence substitution sites as returned by the getHighConfSub function

coverage

An Rle object containing the coverage at each genomic position as returned by a call to coverage

sortedBam

a GRanges object containing all aligned reads, including read sequence (qseq) and MD tag (MD), as returned by the readSortedBam function

method

a character, either set to "mrn" or to "cwt" to compute clusters using the mini-rank norm or the wavelet transform-based algorithm, respectively. Default is "mrn" (recommended).

cores

integer, the number of cores to be used for parallel evaluation. Default is 1.

threshold

numeric, if method = "mrn", the difference in coverage to be considered noise. If not specified, a Gaussian mixture model is used to learn a threshold from the data. Empirically, 10% of the minimum coverage required at substitutions (see argument minCov in the getHighConfSub function) might suffice to provide highly resolved clusters. However, if minCov is much lower than the median strand-specific coverage at substitutions $m$, which can be computed using summary(elementMetadata(highConfSub)[, 'coverage'])['Median']), 10% of $m$ might represent an optimal choice.

step

numeric, if method = "cwt", step size of window shift. If two high-confidence substitution sites are located within a distance less than step, the wavelet transform is computed only once. Default: 1, i.e. each high-confidence substitution site is considered independently.

snr

numeric, if method = "cwt", signal-to-noise ratio controlling the peak calling as performed by wavCWTPeaks implemented in the wmtsa package. Default: 3.

Value

GRanges object containing the identified cluster boundaries.

References

William Constantine and Donald Percival (2011), wmtsa: Wavelet Methods for Time Series Analysis, http://CRAN.R-project.org/package=wmtsa

Sievers C, Schlumpf T, Sawarkar R, Comoglio F and Paro R. (2012) Mixture models and wavelet transforms reveal high confidence RNA-protein interaction sites in MOV10 PAR-CLIP data, Nucleic Acids Res. 40(20):e160. doi: 10.1093/nar/gks697

Comoglio F, Sievers C and Paro R (2015) Sensitive and highly resolved identification of RNA-protein interaction sites in PAR-CLIP data, BMC Bioinformatics 16, 32.

Examples

Run this code

filename <- system.file( "extdata", "example.bam", package = "wavClusteR" )
example <- readSortedBam( filename = filename )
countTable <- getAllSub( example, minCov = 10, cores = 1 )
highConfSub <- getHighConfSub( countTable, supportStart = 0.2, supportEnd = 0.7, substitution = "TC" )
coverage <- coverage( example )
clusters <- getClusters( highConfSub = highConfSub, 
                         coverage = coverage, 
                         sortedBam = example, 
	                 method = 'mrn', 
	                 cores = 1, 
	                 threshold = 2 )