stackStringsFromBam
stacks the read sequences (or their quality
strings) stored in a BAM file over a user-specified region. alphabetFrequencyFromBam
computes the alphabet frequency of the
reads over a user-specified region.
Both functions take into account the CIGAR of each read to "lay" the read sequence (or its quality string) alongside the reference space. This step ensures that each nucleotide in a read is associated with the correct position on the reference sequence.
stackStringsFromBam(file, index=file, param, what="seq", use.names=FALSE, D.letter="-", N.letter=".", Lpadding.letter="+", Rpadding.letter="+")
alphabetFrequencyFromBam(file, index=file, param, what="seq", ...)
scanBam
for more information.
unlist(bamWhich(param))
must have length 1).
Alternatively, param
can be a GRanges or
RangesList object containing exactly 1 genomic region
(the strand will be ignored in case of a GRanges
object), or a character string specifying a single genomic region (in
the "chr14:5201-5300"
format).
"seq"
or "qual"
. If "seq"
(the default), the read sequences will be stacked. If "qual"
,
the read quality strings will be stacked.
sequenceLayer
function.
See ?sequenceLayer
for more details.
stackStrings
function defined in the
Biostrings package.
See ?stackStrings
in the Biostrings
package for more details.
stackStringsFromBam
: A rectangular (i.e. constant-width)
DNAStringSet object (if what
is "seq"
)
or BStringSet object (if what
is "qual"
).For alphabetFrequencyFromBam
: By default a matrix like one returned
by alphabetFrequency
. The matrix has 1 row per
nucleotide position in the specified region.
stackStringsFromBam
performs the 3 following steps:
readGAlignments
function. Note that if the file contains paired-end reads, the
pairing is ignored.
sequenceLayer
function.
stackStrings
function defined in the
Biostrings package.
alphabetFrequencyFromBam
also performs steps 1. and 2. but,
instead of stacking the sequences at step 3., it computes the nucleotide
frequencies for each genomic position in the specified region.
pileLettersAt
function for piling the letters
of a set of aligned reads on top of a set of genomic positions.
readGAlignments
function for loading read
sequences (or their quality strings) from a BAM file (via a
GAlignments object).
sequenceLayer
function for laying read sequences
alongside the reference space, using their CIGARs.
stackStrings
function in the
Biostrings package for stacking an arbitrary
XStringSet object.
alphabetFrequency
function in the
Biostrings package.
## ---------------------------------------------------------------------
## A. EXAMPLE WITH TOY DATA
## ---------------------------------------------------------------------
bamfile1 <- BamFile(system.file("extdata", "ex1.bam", package="Rsamtools"))
region1 <- GRanges("seq1", IRanges(1, 60)) # region of interest
## Stack the read sequences:
stackStringsFromBam(bamfile1, param=region1)
## Compute the "consensus matrix" (1 column per nucleotide position
## in the region of interest):
af <- alphabetFrequencyFromBam(bamfile1, param=region1, baseOnly=TRUE)
cm1a <- t(af[ , DNA_BASES])
cm1a
## Stack their quality strings:
stackStringsFromBam(bamfile1, param=region1, what="qual")
## Control the number of reads to display:
options(showHeadLines=18)
options(showTailLines=6)
stackStringsFromBam(bamfile1, param=GRanges("seq1", IRanges(61, 120)))
stacked_qseq <- stackStringsFromBam(bamfile1, param="seq2:1509-1519")
stacked_qseq # deletion in read 13
af <- alphabetFrequencyFromBam(bamfile1, param="seq2:1509-1519",
baseOnly=TRUE)
cm1b <- t(af[ , DNA_BASES]) # consensus matrix
cm1b
## Sanity check:
stopifnot(identical(consensusMatrix(stacked_qseq)[DNA_BASES, ], cm1b))
stackStringsFromBam(bamfile1, param="seq2:1509-1519", what="qual")
## ---------------------------------------------------------------------
## B. EXAMPLE WITH REAL DATA
## ---------------------------------------------------------------------
library(RNAseqData.HNRNPC.bam.chr14)
bamfile2 <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1])
## Region of interest:
region2 <- GRanges("chr14", IRanges(19650095, 19650159))
readGAlignments(bamfile2, param=ScanBamParam(which=region2))
stackStringsFromBam(bamfile2, param=region2)
af <- alphabetFrequencyFromBam(bamfile2, param=region2, baseOnly=TRUE)
cm2 <- t(af[ , DNA_BASES]) # consensus matrix
cm2
## ---------------------------------------------------------------------
## C. COMPUTE READ CONSENSUS SEQUENCE FOR REGION OF INTEREST
## ---------------------------------------------------------------------
## Let's write our own little naive function to go from consensus matrix
## to consensus sequence. For each nucleotide position in the region of
## interest (i.e. each column in the matrix), we select the letter with
## highest frequency. We also use special letter "*" at positions where
## there is a tie, and special letter "." at positions where all the
## frequencies are 0 (a particular type of tie):
cm_to_cs <- function(cm)
{
stopifnot(is.matrix(cm))
nr <- nrow(cm)
rnames <- rownames(cm)
stopifnot(!is.null(rnames) && all(nchar(rnames) == 1L))
selection <- apply(cm, 2,
function(x) {
i <- which.max(x)
if (x[i] == 0L)
return(nr + 1L)
if (sum(x == x[i]) != 1L)
return(nr + 2L)
i
})
paste0(c(rnames, ".", "*")[selection], collapse="")
}
cm_to_cs(cm1a)
cm_to_cs(cm1b)
cm_to_cs(cm2)
## Note that the consensus sequences we obtain are relative to the
## plus strand of the reference sequence.
Run the code above in your browser using DataLab