AlignedData-class: Class `AlignedData`

Description

This class gives tools to read, store and manage simple data from HTS experiments. This class is specially optimized for chromatin-oriented sequencing experiments (ChIP-Seq, MNase-Seq, Faire-Seq) since it is mainly focused on reads position rather than sequences, ignores alignments with non-simple CIGAR strings (non canonical alignments, ie. insertions/deletions) and secondary alignments, and is not designed to represent features with undefined strands. It directly uses Rsamtools to read data from BAM files. It has a similar global structure to ShortRead and GAlignments classes but have some specific features.

Arguments

Objects from the Class

Objects can be created by a call of the form

new("AlignedData", readID, seqnames, strand, position, mapq,
isize, qwidth, flag, mrnm, notAligned, pairedEnds, weight)

, or by using the dedicated constructor that builds an object from a file.

Details

The specific features of this class are that (i) read sequences are not read from aligned files, (ii) only R native types are used to store data in memory, and (iii) it can represent both paired-end or single-end data by using a similar internal structure as what has been defined for SAM/BAM formats. For reading data from 'non-BAM' formats (aligner-specific), the constructor relies on ShortRead library (see ShortRead for details).

This class is a simplistic representation of aligned reads for users aiming at using piling options offered by Pasha package. Secondary alignments, reads aligned with insertions/deletions (non simple CIGAR), and not aligned reads are discarded by the constructor, and reads with a strand value other than -/+ will raise a warning (strand representation keeps levels c('+', '-', '*') for consistency with bioconductor packages). For other applications requiring exhaustive handling of CIGAR strings (insertions/deletions), undefined strand values, or to keep reads that could not be aligned, the use of GAlignments class family is recommended.

Specific functions are provided for a seamless conversion to/from GAlignments class. It is however important to note that eventual reads with non-simple CIGAR string will be dropped (with a warning) during conversions from GAlignments objects.

Internally, this class defines different kinds of slots. First, slots where a value is mandatory for each read. Second, slots with atomic values (one for the whole object : "notAligned", "pairedEnds"), and finally optionnal slots which can be empty (no information) or should provide information for each read ("mapq", "isize", "flag", "mrnm", "weight", "qwidth").

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Paired-end specificities

The limitations imposed by the constructor when reading paired-end datasets (no unaligned reads, and no secondary alignments) allow to represent the mate reads association by their shared 'qname' (reported as 'readID'). Altering such IDs can thus break the pairs relation as seen by methods of the class.

Several methods are dedicated for paired-ends datasets (see 'checkPairsOK', 'filterInsertSize', 'getOrphansIndexes', 'cleanNonSimplePairs', 'sortByPairs'). These methods will emit a warning and not process data when called for a single-end object.

Examples

Run this code

# Get the BAM file used in RSamTools as example
BAMfile <- system.file("extdata","ex1.bam",package="Rsamtools")
# Testing the public constructor of AlignedData class (from Pasha)
alignedDataObject <- readAlignedData(folderName="" ,
                                     fileName=BAMfile, 
                                     fileType="BAM", 
                                     pairedEnds=TRUE)
print(alignedDataObject)
                             
# Rename the chromosomes from seqX to chrX
alignedDataObject <- normalizeChrNames(alignedDataObject,
                                       chrPrefix="seq",
                                       chrSuffix="")
print(alignedDataObject)
                               
                               
#### Splitting object (indirect use of '[') 
# Split the object and the associated data by chromosomes 
splitListalignedDataObject <- split(alignedDataObject, 
                                    seqnames(alignedDataObject))
# check that the object has correctly been split
print(names(splitListalignedDataObject))
print(sapply(lapply(splitListalignedDataObject, seqnames), unique))


#### Orphans and pairs consistency
print(length(alignedDataObject))
# Identify the reads with no mate and remove them
orphansReadsIndex <- getOrphansIndexes(alignedDataObject)
print(paste("Nb of orphan reads :", sum(orphansReadsIndex)))
alignedDataObject <- alignedDataObject[!orphansReadsIndex]
print(length(alignedDataObject))

# Check that pairs are correctly represented in the object
# ie. mate pairs are consecutive
print(checkPairsOK(alignedDataObject))
# Sort by pairs
alignedDataObject <- sortByPairs(alignedDataObject)
# Check pairs consistency again
print(checkPairsOK(alignedDataObject))


#### Filtering

## Chromosome patterns
# (drop reads on chromosomes containing the character '1')
print(alignedDataObject)
resultFiltering <- dropChromosomePattern(alignedDataObject, 
                                           pattern="1")
print(resultFiltering)

## Range of insert size
print(table(isize(alignedDataObject)))
resultFiltering <- filterInsertSize(alignedDataObject, 
                                    rangeMin=190, 
                                    rangeMax=195)
print(table(isize(resultFiltering)))
print(resultFiltering)

## Reads sequence size
print(table(qwidth(alignedDataObject)))
resultFiltering <- filterReadSize(alignedDataObject, 
                                  rangeMin=35, 
                                  rangeMax=50)
print(table(qwidth(resultFiltering)))
print(resultFiltering)
                          
## Reads with atypic characteristics for chromatin alignments
# Mate unmapped flag set, or mates aligned on same strand
resultFiltering <- cleanNonSimplePairs(alignedDataObject, quiet=FALSE)
# Reads with a strand value different from expected '+' or '-'
resultFiltering <- dropUndefinedStrand(resultFiltering, quiet=FALSE)

Run the code above in your browser using DataLab

Description

Arguments

Objects from the Class

Details

See Also

Examples