preparePairs(bam, param, file, dedup=TRUE, minq=NA, ichim=TRUE, chim.dist=NA, output.dir=NULL)
pairParam
object containing read extraction parametersfile
using the HDF5 format.
Each object corresponds to a pair of chromosomes, designated as the anchor1 (later) or anchor2 (earlier) chromosome based on the order of their names.
Each row of the dataframe contains information for a read pair, with one read mapped to each chromosome.
The dataframe contains several integer fields:
anchor1.id
, anchor2.id
:anchor1.pos
, anchor2.pos
:anchor1.len
, anchor2.len
:getPairData
for more details on length
, orientation
and gap
.An integer vector is also returned from the function, containing various diagnostics:
pairs
:total
, the total number of read pairs; marked
, read pairs with at least one marked 5' end; filtered
, read pairs where the MAPQ score for either 5' end is below minq
; mapped
, read pairs considered as successfully mapped (i.e., not filtered, and also not marked if dedup=TRUE
)same.id
:dangling
, the number of read pairs that are dangling ends; and self.circles
, the number of read pairs forming self-circlessingles
:chimeras
:total
, the total number of read pairs with one chimeric read; mapped
, chimeric read pairs with both 5' ends mapped; multi
, mapped chimeric pairs with at least one successfully mapped 3' segment; and invalid
, read pairs where the 3' location of one read disagrees with the 5' location of the mateparam$fragments
.
Needless to say, the boundary coordinates in param$fragments
must correspond to the reference genome being used.
In most cases, these can be generated using the cutGenome
function from any given BSgenome
object.
If, for any reason, a modified genome is used for alignment, then the coordinates of the restriction fragments on the modified genome are required. Each read pair subsequently becomes associated with a pair of restriction fragments.
The anchor1 fragment is that with the higher genomic coordinate, i.e., the larger index in param$fragments
.
The anchor2 fragment is that with the smaller coordinate/index.
This definition avoids the need to consider both permutations of indices in a pair during downstream processing.ichim=TRUE
is recommended to avoid excessive filtering of reads when alignment of short chimeric segments is inaccurate. By default, invalid chimeras are defined as all pairs where the 3' segment and the mate do not map onto the same restriction fragment in an inward-facing orientation.
Alternatively, a distance-based threshold can be used by setting chim.dist
to some reasonable value, e.g., 1000 bp.
Chimeras are only considered invalid if the distance between the segment and mate is greater than chim.dist
(or if the alignments are not inward-facing).
This may be more relevant in situations involving inefficient cleavage, where the mapping locations are broadly consistent but do not fall in the same restriction fragment. Self-circles are outward-facing read pairs mapped to the same restriction fragment.
These are formed from inefficient cross-linking and are generally uninformative.
Dangling ends are inward-facing read pairs mapped to the same fragment, and are generated from incomplete ligation of blunt ends.
Both constructs are detected and discarded within the function.
Note that this does not consider dangling ends or self-circles formed from incompletely digested fragments, which must be removed with prunePairs
. In all cases, if the 5' end of either read is unavailable (e.g. unmapped, mapping quality score below minq
, marked as a duplicate), the read pair is recorded as unpaired and discarded.
By default, no MAPQ filtering is performed when minq
is set to NA
.
Any duplicate read must be marked in the bit field of the BAM file using a tool like Picard's MarkDuplicates
if it is to be removed with dedup=TRUE
.
For chimeric reads, the recommended approach is to designate the 5' end as the primary alignment.
This ensures that the duplicate computations are performed on the most relevant alignments for each read pair.output.dir
is not specified, a directory name is constructed using the tempfile
command.
If it is specified, users should make sure that no file already exists with that name.
Otherwise, an error will be raised.
This directory is used to store intermediate files that will be eventually processed into the HDF5 output file. Users should note that the use of a pairParam
object for input is strictly for convenience.
Any non-empty values of param$discard
and param$restrict
will be ignored here.
Reads will not be discarded if they lie outside the specified chromosomes, or if they lie within blacklisted regions.Belton, RP et al. (2012). Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58, 268-276.
cutGenome
,
prunePairs
,
mergePairs
,
getPairData
hic.file <- system.file("exdata", "hic_sort.bam", package="diffHic")
cuts <- readRDS(system.file("exdata", "cuts.rds", package="diffHic"))
param <- pairParam(cuts)
tmpf <- "gunk.h5"
preparePairs(hic.file, param, tmpf)
preparePairs(hic.file, param, tmpf, minq=50)
preparePairs(hic.file, param, tmpf, ichim=TRUE)
preparePairs(hic.file, param, tmpf, dedup=FALSE)
Run the code above in your browser using DataLab