NOTE: Until BioC 2.13,
findMateAlignment was the power horse used
readGAlignmentPairs for pairing the records loaded
from a BAM file containing aligned paired-end reads.
Starting with BioC 2.14,
scanBam(BamFile(asMates=TRUE), ...) for the
findMateAlignment(x) makeGAlignmentPairs(x, use.names=FALSE, use.mcols=FALSE, strandMode=1)## Related low-level utilities: getDumpedAlignments() countDumpedAlignments() flushDumpedAlignments()
mpos. Typically obtained by loading aligned paired-end reads from a BAM file with:
param <- ScanBamParam(what=c("flag", "mrnm", "mpos")) x <- readGAlignments(..., use.names=TRUE, param=param)
findMateAlignment: An integer vector of the same length as
x, containing only positive or NA values, where the i-th element is interpreted as follow:
makeGAlignmentPairs: A GAlignmentPairs object where the pairs are formed internally by calling
NULLor a GAlignments object containing the dumped alignments. See "Dumped alignments" subsection in the "Details" section above for the details.For
countDumpedAlignments: The number of dumped alignments.Nothing for
findMateAlignmentis the power horse used by
makeGAlignmentPairsfor pairing the records loaded from a BAM file containing aligned paired-end reads.
It implements the following pairing algorithm:
findMateAlignmentwill ignore any other record. That is, records that correspond to single-end reads, or records that correspond to paired-end reads where one or both ends are unmapped, are discarded.
2 records rec1 and rec2 are considered mates iff all the following conditions are satisfied:
Timing and memory requirement of the pairing algorithm The estimated timings and memory requirements on a modern Linux system are (those numbers may vary depending on your hardware and OS):
nb of alignments | time | required memory -----------------+--------------+---------------- 8 millions | 28 sec | 1.4 GB 16 millions | 58 sec | 2.8 GB 32 millions | 2 min | 5.6 GB 64 millions | 4 min 30 sec | 11.2 GBThis is for a GAlignments object coming from a file with an "average nb of records per unique QNAME" of 2.04. A value of 2 (which means the file contains only primary reads) is optimal for the pairing algorithm. A greater value, say > 3, will significantly degrade its performance. An easy way to avoid this degradation is to load only primary alignments by setting the
FALSEin ScanBamParam(). See examples in
?readGAlignmentPairsfor how to do this.
Ambiguous pairing The above algorithm will find almost all pairs unambiguously, even when the same pair of reads maps to several places in the genome. Note that, when a given pair maps to a single place in the genome, looking at (A) is enough to pair the 2 corresponding records. The additional conditions (B), (C), (D), (E), (F), and (G), are only here to help in the situation where more than 2 records share the same QNAME. And that works most of the times. Unfortunately there are still situations where this is not enough to solve the pairing problem unambiguously.
For example, here are 4 records (loaded in a GAlignments object) that cannot be paired with the above algorithm:
Showing the 4 records as a GAlignments object of length 4:
GAlignments with 4 alignments and 2 metadata columns: seqnames strand cigar qwidth start endNote that the BAM fields show up in the following columns:
SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 SRR031714.2658602 chr2R - 13M372N24M 37 6983858 6984266 SRR031714.2658602 chr2R - 13M378N24M 37 6983858 6984272 width njunc | mrnm mpos | SRR031714.2658602 421 1 | chr2R 6983858 SRR031714.2658602 421 1 | chr2R 6983858 SRR031714.2658602 409 1 | chr2R 6983850 SRR031714.2658602 415 1 | chr2R 6983850
As you can see, the aligner has aligned the same pair to the same location twice! The only difference between the 2 aligned pairs is in the CIGAR i.e. one end of the pair is aligned twice to the same location with exactly the same CIGAR while the other end of the pair is aligned twice to the same location but with slightly different CIGARs.
Now showing the corresponding flag bits:
isPaired isProperPair isUnmappedQuery hasUnmappedMate isMinusStrand [1,] 1 1 0 0 0 [2,] 1 1 0 0 0 [3,] 1 1 0 0 1 [4,] 1 1 0 0 1 isMateMinusStrand isFirstMateRead isSecondMateRead isSecondaryAlignment [1,] 1 0 1 0 [2,] 1 0 1 0 [3,] 0 1 0 0 [4,] 0 1 0 0 isNotPassingQualityControls isDuplicate [1,] 0 0 [2,] 0 0 [3,] 0 0 [4,] 0 0
As you can see, rec(1) and rec(2) are second mates, rec(3) and rec(4) are both first mates. But looking at (A), (B), (C), (D), (E), (F), and (G), the pairs could be rec(1) <-> rec(3) and rec(2) <-> rec(4), or they could be rec(1) <-> rec(4) and rec(2) <-> rec(3). There is no way to disambiguate!
findMateAlignment is just ignoring (with a warning) those alignments
with ambiguous pairing, and dumping them in a place from which they can be
retrieved later (i.e. after
findMateAlignment has returned) for
further examination (see "Dumped alignments" subsection below for the details).
In other words, alignments that cannot be paired unambiguously are not paired
at all. Concretely, this means that
guaranteed to return a GAlignmentPairs object
where every pair was formed in an non-ambiguous way. Note that, in practice,
this approach doesn't seem to leave aside a lot of records because ambiguous
pairing events seem pretty rare.
Alignments with ambiguous pairing are dumped in a place ("the dump
environment") from which they can be retrieved with
findMateAlignment has returned.
Two additional utilities are provided for manipulation of the dumped
countDumpedAlignments for counting them (a fast equivalent
flush "the dump environment". Note that "the dump environment" is
automatically flushed at the beginning of a call to
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) param <- ScanBamParam(what=c("flag", "mrnm", "mpos")) x <- readGAlignments(bamfile, use.names=TRUE, param=param) mate <- findMateAlignment(x) head(mate) table(is.na(mate)) galp0 <- makeGAlignmentPairs(x) galp <- makeGAlignmentPairs(x, use.name=TRUE, use.mcols="flag") galp colnames(mcols(galp)) colnames(mcols(first(galp))) colnames(mcols(last(galp)))
Run the code above in your browser using DataCamp Workspace