stackStringsFromBam: Stack the read sequences stored in a BAM file on a region of interest

Description

stackStringsFromBam stacks the read sequences (or their quality strings) stored in a BAM file over a user-specified region.

alphabetFrequencyFromBam computes the alphabet frequency of the reads over a user-specified region.

Both functions take into account the CIGAR of each read to "lay" the read sequence (or its quality string) alongside the reference space. This step ensures that each nucleotide in a read is associated with the correct position on the reference sequence.

Usage

stackStringsFromBam(file, index=file, param, what="seq", use.names=FALSE, D.letter="-", N.letter=".", Lpadding.letter="+", Rpadding.letter="+")
alphabetFrequencyFromBam(file, index=file, param, what="seq", ...)

Arguments

file, index

The path to the BAM file to read, and to the index file of the BAM file to read, respectively. The latter is given without the '.bai' extension. See scanBam for more information.

param

A ScanBamParam object containing exactly 1 genomic region (i.e. unlist(bamWhich(param)) must have length 1). Alternatively, param can be a GRanges or RangesList object containing exactly 1 genomic region (the strand will be ignored in case of a GRanges object), or a character string specifying a single genomic region (in the "chr14:5201-5300" format).

what

A single string. Either "seq" or "qual". If "seq" (the default), the read sequences will be stacked. If "qual", the read quality strings will be stacked.

use.names

Use the query template names (QNAME field) as the names of the returned object? If not (the default), then the returned object has no names.

D.letter, N.letter

A single letter used as a filler for injections. The 2 arguments are passed down to the sequenceLayer function. See ?sequenceLayer for more details.

Lpadding.letter, Rpadding.letter

A single letter to use for padding the sequences on the left, and another one to use for padding on the right. The 2 arguments are passed down to the stackStrings function defined in the Biostrings package. See ?stackStrings in the Biostrings package for more details.

...

Further arguments to be passed to alphabetFrequency.

Value

For stackStringsFromBam: A rectangular (i.e. constant-width) DNAStringSet object (if what is "seq") or BStringSet object (if what is "qual").For alphabetFrequencyFromBam: By default a matrix like one returned by alphabetFrequency. The matrix has 1 row per nucleotide position in the specified region.

Details

stackStringsFromBam performs the 3 following steps:

Load the read sequences (or their quality strings) from the BAM file. Only the read sequences that overlap with the specified region are loaded. This is done with the readGAlignments function. Note that if the file contains paired-end reads, the pairing is ignored.
Lay the sequences alongside the reference space, using their CIGARs. This is done with the sequenceLayer function.
Stack them on the specified region. This is done with the stackStrings function defined in the Biostrings package.

alphabetFrequencyFromBam also performs steps 1. and 2. but, instead of stacking the sequences at step 3., it computes the nucleotide frequencies for each genomic position in the specified region.

Examples

Run this code

## ---------------------------------------------------------------------
## A. EXAMPLE WITH TOY DATA
## ---------------------------------------------------------------------

bamfile1 <- BamFile(system.file("extdata", "ex1.bam", package="Rsamtools"))

region1 <- GRanges("seq1", IRanges(1, 60))  # region of interest

## Stack the read sequences:
stackStringsFromBam(bamfile1, param=region1)

## Compute the "consensus matrix" (1 column per nucleotide position
## in the region of interest):
af <- alphabetFrequencyFromBam(bamfile1, param=region1, baseOnly=TRUE)
cm1a <- t(af[ , DNA_BASES])
cm1a

## Stack their quality strings:
stackStringsFromBam(bamfile1, param=region1, what="qual")

## Control the number of reads to display:
options(showHeadLines=18)
options(showTailLines=6)
stackStringsFromBam(bamfile1, param=GRanges("seq1", IRanges(61, 120)))

stacked_qseq <- stackStringsFromBam(bamfile1, param="seq2:1509-1519")
stacked_qseq  # deletion in read 13
af <- alphabetFrequencyFromBam(bamfile1, param="seq2:1509-1519",
                                baseOnly=TRUE)
cm1b <- t(af[ , DNA_BASES])  # consensus matrix
cm1b

## Sanity check:
stopifnot(identical(consensusMatrix(stacked_qseq)[DNA_BASES, ], cm1b))

stackStringsFromBam(bamfile1, param="seq2:1509-1519", what="qual")

## ---------------------------------------------------------------------
## B. EXAMPLE WITH REAL DATA
## ---------------------------------------------------------------------

library(RNAseqData.HNRNPC.bam.chr14)
bamfile2 <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1])

## Region of interest:
region2 <- GRanges("chr14", IRanges(19650095, 19650159))

readGAlignments(bamfile2, param=ScanBamParam(which=region2))

stackStringsFromBam(bamfile2, param=region2)

af <- alphabetFrequencyFromBam(bamfile2, param=region2, baseOnly=TRUE)
cm2 <- t(af[ , DNA_BASES])  # consensus matrix
cm2

## ---------------------------------------------------------------------
## C. COMPUTE READ CONSENSUS SEQUENCE FOR REGION OF INTEREST
## ---------------------------------------------------------------------

## Let's write our own little naive function to go from consensus matrix
## to consensus sequence. For each nucleotide position in the region of
## interest (i.e. each column in the matrix), we select the letter with
## highest frequency. We also use special letter "*" at positions where
## there is a tie, and special letter "." at positions where all the
## frequencies are 0 (a particular type of tie):
cm_to_cs <- function(cm)
{
    stopifnot(is.matrix(cm))
    nr <- nrow(cm)
    rnames <- rownames(cm)
    stopifnot(!is.null(rnames) && all(nchar(rnames) == 1L))
    selection <- apply(cm, 2,
                       function(x) {
                         i <- which.max(x)
                         if (x[i] == 0L)
                           return(nr + 1L)
                         if (sum(x == x[i]) != 1L)
                           return(nr + 2L)
                         i
                       })
    paste0(c(rnames, ".", "*")[selection], collapse="")
}

cm_to_cs(cm1a)
cm_to_cs(cm1b)
cm_to_cs(cm2)

## Note that the consensus sequences we obtain are relative to the
## plus strand of the reference sequence.

Run the code above in your browser using DataLab