preparePairs: Prepare Hi-C pairs

Description

Identifies the interacting pair of restriction fragments corresponding to each read pair in a Hi-C library.

Usage

preparePairs(bam, param, file, dedup=TRUE, minq=NA,  ichim=TRUE, chim.dist=NA, output.dir=NULL)

Arguments

bam

a character string containing the path to a name-sorted BAM file

param

a pairParam object containing read extraction parameters

file

a character string specifying the path to an output index file

dedup

a logical scalar indicating whether marked duplicate reads should be removed

minq

an integer scalar specifying the minimum mapping quality for each read

ichim

a logical scalar indicating whether invalid chimeras should be counted

chim.dist

an integer scalar specifying the maximum distance between segments for a valid chimeric read pair

output.dir

a character string specifying a directory for temporary files

Value

Multiple dataframe objects are stored within the specified file using the HDF5 format. Each object corresponds to a pair of chromosomes, designated as the anchor1 (later) or anchor2 (earlier) chromosome based on the order of their names. Each row of the dataframe contains information for a read pair, with one read mapped to each chromosome. The dataframe contains several integer fields:

anchor1.id, anchor2.id:: index of the anchor1 or anchor2 restriction fragment
anchor1.pos, anchor2.pos:: mapping coordinate of the read (or for chimeras, the 5' segment thereof) on the anchor1 or anchor2 fragment
anchor1.len, anchor2.len:: length of the alignment on the anchor1 or anchor2 fragment, set to a negative value if the alignment is on the reverse strand

See getPairData for more details on length, orientation and gap.An integer vector is also returned from the function, containing various diagnostics:

pairs:: an integer vector containing total, the total number of read pairs; marked, read pairs with at least one marked 5' end; filtered, read pairs where the MAPQ score for either 5' end is below minq; mapped, read pairs considered as successfully mapped (i.e., not filtered, and also not marked if dedup=TRUE)
same.id:: an integer vector containing dangling, the number of read pairs that are dangling ends; and self.circles, the number of read pairs forming self-circles
singles:: an integer scalar specifying the number of reads without a mate
chimeras:: an integer vector containing total, the total number of read pairs with one chimeric read; mapped, chimeric read pairs with both 5' ends mapped; multi, mapped chimeric pairs with at least one successfully mapped 3' segment; and invalid, read pairs where the 3' location of one read disagrees with the 5' location of the mate

Converting to restriction fragment indices

The resolution of a Hi-C experiment is defined by the distribution of restriction sites across the genome. Thus, it makes sense to describe interactions in terms of restriction fragments. This function identifies the interacting fragments corresponding to each pair of reads in a Hi-C library. To save space, it stores the indices of the interacting fragments for each read pair, rather than the fragments themselves. Indexing is performed by matching up the mapping coordinates for each read with the restriction fragment boundaries in param$fragments. Needless to say, the boundary coordinates in param$fragments must correspond to the reference genome being used. In most cases, these can be generated using the cutGenome function from any given BSgenome object. If, for any reason, a modified genome is used for alignment, then the coordinates of the restriction fragments on the modified genome are required. Each read pair subsequently becomes associated with a pair of restriction fragments. The anchor1 fragment is that with the higher genomic coordinate, i.e., the larger index in param$fragments. The anchor2 fragment is that with the smaller coordinate/index. This definition avoids the need to consider both permutations of indices in a pair during downstream processing.

Handling of read pairs

For pairs with chimeric reads, the alignment of the 5' end of the read is used to identify the interacting fragments. Invalid chimeras arise when the position of the 3' segment of a chimeric read is not consistent with that of the mate read. Invalid chimeric pairs are generally indicative of mapping errors but can also form due to non-specific ligation events. While they can be explicitly removed, setting ichim=TRUE is recommended to avoid excessive filtering of reads when alignment of short chimeric segments is inaccurate. By default, invalid chimeras are defined as all pairs where the 3' segment and the mate do not map onto the same restriction fragment in an inward-facing orientation. Alternatively, a distance-based threshold can be used by setting chim.dist to some reasonable value, e.g., 1000 bp. Chimeras are only considered invalid if the distance between the segment and mate is greater than chim.dist (or if the alignments are not inward-facing). This may be more relevant in situations involving inefficient cleavage, where the mapping locations are broadly consistent but do not fall in the same restriction fragment. Self-circles are outward-facing read pairs mapped to the same restriction fragment. These are formed from inefficient cross-linking and are generally uninformative. Dangling ends are inward-facing read pairs mapped to the same fragment, and are generated from incomplete ligation of blunt ends. Both constructs are detected and discarded within the function. Note that this does not consider dangling ends or self-circles formed from incompletely digested fragments, which must be removed with prunePairs. In all cases, if the 5' end of either read is unavailable (e.g. unmapped, mapping quality score below minq, marked as a duplicate), the read pair is recorded as unpaired and discarded. By default, no MAPQ filtering is performed when minq is set to NA. Any duplicate read must be marked in the bit field of the BAM file using a tool like Picard's MarkDuplicates if it is to be removed with dedup=TRUE. For chimeric reads, the recommended approach is to designate the 5' end as the primary alignment. This ensures that the duplicate computations are performed on the most relevant alignments for each read pair.

Miscellaneous information

If output.dir is not specified, a directory name is constructed using the tempfile command. If it is specified, users should make sure that no file already exists with that name. Otherwise, an error will be raised. This directory is used to store intermediate files that will be eventually processed into the HDF5 output file. Users should note that the use of a pairParam object for input is strictly for convenience. Any non-empty values of param$discard and param$restrict will be ignored here. Reads will not be discarded if they lie outside the specified chromosomes, or if they lie within blacklisted regions.

References

Imakaev M et al. (2012). Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods 9, 999-1003.

Belton, RP et al. (2012). Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58, 268-276.

Examples

Run this code

hic.file <- system.file("exdata", "hic_sort.bam", package="diffHic")
cuts <- readRDS(system.file("exdata", "cuts.rds", package="diffHic"))
param <- pairParam(cuts) 

tmpf <- "gunk.h5"
preparePairs(hic.file, param, tmpf)
preparePairs(hic.file, param, tmpf, minq=50)
preparePairs(hic.file, param, tmpf, ichim=TRUE)
preparePairs(hic.file, param, tmpf, dedup=FALSE)

Run the code above in your browser using DataLab