Learn R Programming

diffHic (version 1.4.2)

DNaseHiC: Methods for processing DNase Hi-C data

Description

Processing of BAM files for DNase Hi-C into index files

Usage

segmentGenome(bs, size=500) prepPseudoPairs(bam, param, file, dedup=TRUE, ichim=TRUE, chim.span=1000, minq=NA, output.dir=NULL)

Arguments

bs
a BSgenome object, or a character string pointing to a FASTA file, or a named integer vector of chromosome lengths
size
an integer scalar indicating the size of the pseudo-fragments
bam
a character string containing the path to a name-sorted BAM file
param
a pairParam object containing read extraction parameters
file
a character string specifying the path to an output index file
dedup
a logical scalar indicating whether marked duplicate reads should be removed
ichim
a logical scalar indicating whether invalid chimeras should be counted
chim.span
an integer scalar specifying the maximum span between a chimeric 3' end and a mate read
minq
an integer scalar specifying the minimum mapping quality for each read
output.dir
a character string specifying a directory for temporary files

Value

For segmentGenome, a GRanges object is produced containing the coordinates of the pseudo-fragments in the specified genome.For prepPseudoPairs, a HDF5-formatted index file is produced at the specified location. A list of diagnostic vectors are also returned in the same format as that from preparePairs, without the same.id entry.

Details

DNase Hi-C involves random fragmentation with DNase instead of restriction enzymes. This is accommodated in diffHic by partitioning the genome into small pseudo-fragments, using segmentGenome. Reads are then assigned into these pseudo-fragments using prepPseudoPairs. The rest of the analysis pipeline can then be used in the same manner as that for standard Hi-C.

The behavior of prepPseudoPairs is almost identical to that for preparePairs, if the latter were asked to assign reads into pseudo-fragments. However, for prepPseudoPairs, no reporting or removal of self-circles or dangling ends is performed, as these have no meaning for artificial fragments. Also, invalidity of chimeras is determined by checking whether the 3' end is more than chim.span away from the mate read, rather than checking for localization in different fragments.

The size of the pseudo-fragments is determined by, well, size in segmentGenome. Smaller sizes provide better resolution but increase computational work. Needless to say, the param$fragments field should contain the output from segmentGenome, rather than from cutGenome. Also see cutGenome documentation for a warning about the chromosome names.

Some loss of spatial resolution is inevitable when reads are summarized into pseudo-fragments. This is largely irrelevant, though, as counting across the interaction space will ultimately use much larger bins (usually at least 2 kbp).

See Also

preparePairs, cutGenome

Examples

Run this code
require(BSgenome.Ecoli.NCBI.20080805)
segmentGenome(BSgenome.Ecoli.NCBI.20080805)
segmentGenome(BSgenome.Ecoli.NCBI.20080805, size=1000)

# Pretend that this example is DNase Hi-C.
hic.file <- system.file("exdata", "hic_sort.bam", package="diffHic")
cuts <- readRDS(system.file("exdata", "cuts.rds", package="diffHic"))
pseudo <- segmentGenome(seqlengths(cuts), size=50) 
param <- pairParam(pseudo) 

tmpf <- "gunk.h5"
prepPseudoPairs(hic.file, param, tmpf)
prepPseudoPairs(hic.file, param, tmpf, dedup=FALSE)
prepPseudoPairs(hic.file, param, tmpf, minq=50)
prepPseudoPairs(hic.file, param, tmpf, chim.span=20)


Run the code above in your browser using DataLab