readGAlignments
Reading genomic alignments from a file
Read genomic alignments from a file (typically a BAM file) into a GAlignments, GAlignmentPairs, GAlignmentsList, or GappedReads object.
- Keywords
- manip
Usage
## Front-ends
readGAlignments(file, format="BAM", use.names=FALSE, ...)
readGAlignmentPairs(file, format="BAM", use.names=FALSE, ...)
readGAlignmentsList(file, format="BAM", use.names=FALSE, ...)
readGappedReads(file, format="BAM", use.names=FALSE, ...)
## BAM specific back-ends
readGAlignmentsFromBam(file, index=file, ..., use.names=FALSE, param=NULL, with.which_label=FALSE)
readGAlignmentPairsFromBam(file, index=file, use.names=FALSE, param=NULL, with.which_label=FALSE)
readGAlignmentsListFromBam(file, index=file, ..., use.names=FALSE, param=ScanBamParam(), with.which_label=FALSE)
readGappedReadsFromBam(file, index=file, use.names=FALSE, param=NULL, with.which_label=FALSE)
Arguments
- file
-
The path to the file to read or a BamFile object.
Can also be a BamViews object for
readGAlignmentsFromBam
. - format
-
Only
"BAM"
(the default) is supported for now. - use.names
- Use the query template names (QNAME field) as the names of the returned object? If not (the default), then the returned object has no names.
- ...
- Arguments passed to other methods.
- index
- The path to the index file of the BAM file to read.
Must be given without the '.bai' extension.
See
scanBam
in the Rsamtools packages for more information. - param
NULL
or a ScanBamParam object. Like forscanBam
, this influences what fields and which records are imported. However, note that the fields specified thru this ScanBamParam object will be loaded in addition to any field required for generating the returned object (GAlignments, GAlignmentPairs, or GappedReads object), but only the fields requested by the user will actually be kept as metadata columns of the object.By default (i.e.
param=NULL
orparam=ScanBamParam()
), no additional field is loaded. The flag used isscanBamFlag(isUnmappedQuery=FALSE)
forreadGAlignmentsFromBam
,readGappedReadsFromBam
andreadGAlignmentsListFromBam
(i.e. only records corresponding to mapped reads are loaded), andscanBamFlag(isUnmappedQuery=FALSE, isPaired=TRUE, hasUnmappedMate=FALSE)
forreadGAlignmentPairsFromBam
(i.e. only records corresponding to paired-end reads with both ends mapped are loaded).- with.which_label
TRUE
orFALSE
(the default). IfTRUE
and ifparam
has awhich
component, a"which_label"
metadata column is added to the returned GAlignments or GappedReads object, or to thefirst
andlast
components of the returned GAlignmentPairs object. In the case ofreadGAlignmentsListFromBam
, it's added as an inner metadata column, that is, the metadata column is placed on the GAlignments object obtained by unlisting the returned GAlignmentsList object.The purpose of this metadata column is to unambiguously identify the range in
which
where each element in the returned object originates from. The labels used to identify the ranges are normally of the form"seq1:12250-246500"
, that is, they're the same as the names found on the outer list thatscanBam
would return if called with the sameparam
argument. If some ranges are duplicated, then the labels are made unique by appending a unique suffix to all of them. The"which_label"
metadata column is represented as a factor-Rle.
Details
See ?GAlignments
for a description of GAlignments
objects.
See ?GappedReads
for a description of GappedReads
objects.
Front-ends
readGAlignments
reads a file containing aligned reads as a
GAlignments object.
readGAlignmentPairs
reads a file containing aligned paired-end
reads as a GAlignmentPairs object.
readGAlignmentsList
reads a file containing aligned reads as a
GAlignmentsList object.
readGappedReads
reads a file containing aligned reads as a
GappedReads object.
By default (i.e. use.names=FALSE
), the resulting object has no
names. If use.names
is TRUE
, then the names are
constructed from the query template names (QNAME field in a SAM/BAM
file). Note that the 2 records in a pair (when using
readGAlignmentPairs
or the records in a group (when using
readGAlignmentsList
) have the same QNAME.
These functions are just front-ends that delegate to a
format-specific back-end function depending on the supplied format
argument. The use.names
argument and any extra argument are
passed to the back-end function.
Only the BAM format is supported for now via the read*FromBam
back-end functions.
BAM specific back-ends
When file
is BamViews object
readGAlignmentsFromBam
visits each path in bamPaths(file)
,
returning the result of readGAlignmentsFromBam
applied to the
specified path. When index
is missing, it is set equal to
bamIndicies(file)
. Only reads in bamRanges(file)
are
returned (if param
is supplied, bamRanges(file)
takes
precedence over bamWhich(param)
).
The return value is a SimpleList object, with elements
of the list corresponding to each path. bamSamples(file)
is
available as metadata columns (accessed with mcols
) of the
returned SimpleList object.
readGAlignmentPairsFromBam
proceeds in 2 steps:
- Load the BAM file into a GAlignmentsList
object with
readGAlignmentsListFromBam
(see below); - Turn this GAlignmentsList object into a
GAlignmentPairs object. Only list elements marked with
mate status
"mated"
go into the returned GAlignmentPairs object.
See ?GAlignmentPairs
for a description of
GAlignmentPairs objects.
readGAlignmentsListFromBam
pairs records into mates
according to the pairing criteria described below. The 1st mate
will always be 1st in the GAlignmentsList
list elements that
have mate_status set to "mated"
, and the 2nd mate will
always be 2nd.
A GAlignmentsList
is returned with a mate_status
metadata column on the outer list elements. mate_status
is a
factor with 3 levels indicating mate status, mated,
ambiguous or unmated
.
Mate status:
- mated: primary or non-primary pairs
- ambiguous: multiple segments matching to the same location (indistinguishable)
- unmated: mate does not exist or is unmapped
When the file argument is a BamFile, asMates=TRUE
must be set, otherwise the data are treated as single-end reads.
See the asMates section of ?BamFile
for details.
Flags, tags and ranges may be specified in the ScanBamParam
for fine tuning of results.
See ?GAlignmentsList-class
for a
description of GAlignmentsList objects.
Pairing criteria
This section describes the pairing criteria used by
readGAlignmentsListFromBam
and readGAlignmentPairsFromBam
.
- First, only records with flag bit 0x1 (multiple segments) set to 1, flag bit 0x4 (segment unmapped) set to 0, and flag bit 0x8 (next segment in the template unmapped) set to 0, are candidates for pairing (see the SAM Spec for a description of flag bits and fields). Records that correspond to single-end reads, or records that correspond to paired-end reads where one or both ends are unmapped, will remain unmated.
- (A) QNAME
- (B) RNAME, RNEXT
- (C) POS, PNEXT
- (D) Flag bits Ox10 (segment aligned to minus strand) and 0x20 (next segment aligned to minus strand)
- (E) Flag bits 0x40 (first segment in template) and 0x80 (last segment in template)
- (F) Flag bit 0x2 (proper pair)
- (G) Flag bit 0x100 (secondary alignment)
2 records rec1 and rec2 are considered mates iff all the following conditions are satisfied:
- (A) QNAME(rec1) == QNAME(rec2)
- (B) RNEXT(rec1) == RNAME(rec2) and RNEXT(rec2) == RNAME(rec1)
- (C) PNEXT(rec1) == POS(rec2) and PNEXT(rec2) == POS(rec1)
- (D) Flag bit 0x20 of rec1 == Flag bit 0x10 of rec2 and Flag bit 0x20 of rec2 == Flag bit 0x10 of rec1
- (E) rec1 corresponds to the first segment in the template and rec2 corresponds to the last segment in the template, OR, rec2 corresponds to the first segment in the template and rec1 corresponds to the last segment in the template
- (F) rec1 and rec2 have same flag bit 0x2
- (G) rec1 and rec2 have same flag bit 0x100
Note that this is actually the pairing criteria used by
scanBam
(when the BamFile
passed to it has the asMates
toggle set to TRUE
), which
readGAlignmentsListFromBam
and readGAlignmentPairsFromBam
call behind the scene. It is also the pairing criteria used by
findMateAlignment
.
Value
-
A GAlignments object for
readGAlignmentsFromBam
.A GAlignmentPairs object for readGAlignmentPairsFromBam
.
Note that a BAM (or SAM) file can in theory contain a mix of single-end
and paired-end reads, but in practise it seems that single-end and
paired-end are not mixed. In other words, the value of flag bit 0x1
(isPaired
) is the same for all the records in a file.
So if readGAlignmentPairsFromBam
returns a
GAlignmentPairs object of length zero,
this almost certainly means that the BAM (or SAM) file contains
alignments for single-end reads (although it could also mean that the
user-supplied ScanBamParam
is filtering out everything,
or that the file is empty, or that all the records in the file correspond
to unmapped reads).A GAlignmentsList object for readGAlignmentsListFromBam
.
When the list contains paired-end reads a metadata data column of
mate_status
is added to the object. See details in the
`Bam specific back-ends' section on this man page.A GappedReads object for readGappedReadsFromBam
.
Note
BAM records corresponding to unmapped reads are always ignored.
Starting with Rsamtools 1.7.1 (BioC 2.10), PCR or optical duplicates
are loaded by default (use scanBamFlag(isDuplicate=FALSE)
to
drop them).
See Also
-
scanBam
andScanBamParam
in the Rsamtools package. - GAlignments, GAlignmentPairs, GAlignmentsList,
and GappedReads objects.
-
findMateAlignment
.
Examples
## ---------------------------------------------------------------------
## A. readGAlignmentsFromBam()
## ---------------------------------------------------------------------
## Simple use:
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
mustWork=TRUE)
gal1 <- readGAlignmentsFromBam(bamfile)
gal1
names(gal1)
## Using the 'use.names' arg:
gal2 <- readGAlignmentsFromBam(bamfile, use.names=TRUE)
gal2
head(names(gal2))
## Using the 'param' arg to drop PCR or optical duplicates as well as
## secondary alignments, and to load additional BAM fields:
param <- ScanBamParam(flag=scanBamFlag(isDuplicate=FALSE,
isNotPrimaryRead=FALSE),
what=c("qual", "flag"))
gal3 <- readGAlignmentsFromBam(bamfile, param=param)
gal3
mcols(gal3)
## Using the 'param' arg to load reads from particular regions.
## Note that if we weren't providing a 'what' argument here, all the
## BAM fields would be loaded:
which <- RangesList(seq1=IRanges(1000, 2000),
seq2=IRanges(c(100, 1000), c(1000, 2000)))
param <- ScanBamParam(which=which)
gal4 <- readGAlignmentsFromBam(bamfile, param=param)
gal4
## Note that a given record is loaded one time for each region it
## belongs to (this is a scanBam() feature, readGAlignmentsFromBam()
## is based on scanBam()):
which <- IRangesList(seq2=IRanges(c(1563, 1567), width=1))
param <- ScanBamParam(which=which)
gal5 <- readGAlignmentsFromBam(bamfile, param=param)
gal5
## Use 'with.which_label=TRUE' to identify the range in 'which'
## where each element in 'gal5' originates from.
gal5 <- readGAlignmentsFromBam(bamfile, param=param,
with.which_label=TRUE)
gal5
## Using the 'param' arg to load tags. Except for MF and Aq, the tags
## specified below are predefined tags (see the SAM Spec for the list
## of predefined tags and their meaning).
param <- ScanBamParam(tag=c("MF", "Aq", "NM", "UQ", "H0", "H1"),
what="isize")
gal6 <- readGAlignmentsFromBam(bamfile, param=param)
mcols(gal6) # "tag" cols always after "what" cols
## With a BamViews object:
fls <- system.file("extdata", "ex1.bam", package="Rsamtools",
mustWork=TRUE)
bv <- BamViews(fls,
bamSamples=DataFrame(info="test", row.names="ex1"),
auto.range=TRUE)
aln <- readGAlignmentsFromBam(bv)
aln
aln[[1]]
aln[colnames(bv)]
mcols(aln)
## ---------------------------------------------------------------------
## B. readGAlignmentPairsFromBam()
## ---------------------------------------------------------------------
galp1 <- readGAlignmentPairsFromBam(bamfile)
head(galp1)
names(galp1)
## Using the 'param' arg to drop PCR or optical duplicates as well as
## secondary alignments (dropping secondary alignments can help make the
## pairing algorithm run significantly faster, see ?findMateAlignment):
param <- ScanBamParam(flag=scanBamFlag(isDuplicate=FALSE,
isNotPrimaryRead=FALSE))
galp2 <- readGAlignmentPairsFromBam(bamfile, use.names=TRUE, param=param)
galp2
head(galp2)
head(names(galp2))
## ---------------------------------------------------------------------
## C. readGAlignmentsListFromBam()
## ---------------------------------------------------------------------
library(pasillaBamSubset)
## 'file' as character.
fl <- untreated3_chr4()
galist1 <- readGAlignmentsListFromBam(fl)
galist1[1:3]
length(galist1)
table(elementLengths(galist1))
## When 'file' is a BamFile, 'asMates' must be TRUE. If FALSE,
## the data are treated as single-end and each list element of the
## GAlignmentsList will be of length 1. For single-end data
## use readGAlignments().
bf <- BamFile(fl, yieldSize=3, asMates=TRUE)
readGAlignmentsList(bf)
## Use a 'param' to fine tune the results.
param <- ScanBamParam(flag=scanBamFlag(isProperPair=TRUE))
galist2 <- readGAlignmentsListFromBam(fl, param=param)
length(galist2)
## ---------------------------------------------------------------------
## D. readGappedReadsFromBam()
## ---------------------------------------------------------------------
greads1 <- readGappedReadsFromBam(bamfile)
greads1
names(greads1)
qseq(greads1)
greads2 <- readGappedReadsFromBam(bamfile, use.names=TRUE)
head(greads2)
head(names(greads2))