Count reads in binding site intervals

Counts reads in binding site intervals. Files must be one of bam, bed and gzip-compressed bed. File suffixes must be ".bam", ".bed", or ".bed.gz" respectively.

dba.count(DBA, peaks, minOverlap=2, score=DBA_SCORE_TMM_MINUS_FULL, bLog=FALSE, fragmentSize=DBA$config$fragmentSize, summits, filter=0, bRemoveDuplicates=FALSE, bScaleControl=TRUE, mapQCth=DBA$config$mapQCth, filterFun=max, bCorPlot=DBA$config$bCorPlot, bUseSummarizeOverlaps=FALSE, readFormat=DBA_READS_DEFAULT, bParallel=DBA$config$RunParallel)
DBA object
If GRanges, RangedData, dataframe, or matrix, this parameter contains the intervals to use for counting. If character string, it specifies a file containing the intervals to use (with the first three columns specifying chromosome, startpos, endpos).If missing or a mask, generates a consensus peakset using minOverlap parameter (after applying the mask if present). If NULL, the score, filter, and summits parameters are honored, updating the global binding matrix without re-counting in the cases of score and filter, and only counting after re-centering in the case of summits.
only include peaks in at least this many peaksets when generating consensus peakset (i.e. when peaks parameter is missing). If minOverlap is between zero and one, peak will be included from at least this proportion of peaksets.
which score to use in the binding affinity matrix. Note that all raw read counts are maintained for use by dba.analyze, regardless of how this is set. One of:
raw read count for interval using only reads from ChIP
raw read count for interval from ChIP divided by read count for interval from control
raw read count for interval from ChIP minus read count for interval from control
RPKM for interval using only reads from ChIP
RPKM for interval from ChIP divided by RPKM for interval from control
TMM normalized (using edgeR), using ChIP read counts and Full Library size
TMM normalized (using edgeR), using ChIP read counts and Effective Library size
TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Full Library size
TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Effective Library size
same as DBA_SCORE_TMM_READS_FULL, but reporrted in counts-per-million.
same as DBA_SCORE_TMM_READS_EFFECTIVE, but reporrted in counts-per-million.
same as DBA_SCORE_TMM_MINUS_FULL, but reporrted in counts-per-million.
Tsame as DBA_SCORE_TMM_MINUS_EFFECTIVE, but reporrted in counts-per-million.
summit height (maximum read pileup value)
summit height (maximum read pileup value), normalized to relative library size
summit position (location of maximum read pileup)
logical indicating whether log2 of score should be used (only applies to DBA_SCORE_RPKM_FOLD and DBA_SCORE_READS_FOLD).
This value will be used as the length of the reads. Each read will be extended from its endpoint along the appropriate strand by this many bases. If set to zero, the read size indicated in the BAM/BED file will be used. fragmentSize may also be a vector of values, one for each ChIP sample plus one for each unique Control library.
if present, summit heights (read pileup) and locations will be calculated for each peak. The values can retrieved using dba.peakset. The summits can also be used as a read score in the global binding matrix (see score).

If the value of summits is TRUE (or 0), the summits will be calculated but the peaksets will be unaffected. If the value is greater than zero, all consensus peaks will be re-centered around a consensus summit, with the value of summits indicating how many base pairs to include upstream and downstream of the summit (so all consensus peaks will be of the same width, namely 2 * summits).

Note that if summits is greater than zero, the counting procedure will take twice as long, and bUseSummarizeOverlaps must be FALSE.

value to use for filtering intervals with low read counts. Only intervals where at least one sample has a score at least this high will be included. If peaks is NULL, will remove sites from existing DBA object without recounting. If filter is a vector of values, dba.count will return a vector of the same length, indicating how many intervals will be retained for each specified filter level.
logical indicating if duplicate reads (ones that map to exactly the same genomic position) should be removed. If TRUE, any location where multiple reads map will be counted as a single read. Note that if bLowMem is set, duplicates needs to have been already marked in all of the BAM files. The built-in counting code may not correctly handle certain cases when the bRemoveDuplicates paramter is set to TRUE. These cases include paried-end data and datasets where the read length may differ within a single BAM file. In these cases, see the bUseSummarizeOverlaps parameter.
logical indicating if the Control reads should be scaled based on relative library sizes. If TRUE, and there are more reads in the Control library than in the ChIP library, the number of Control reads for each peak will be multiplied by a scaling factor determined by dividing the total number of reads in the ChIP library by the total number of reads in the Control library. If this value is not an integer, the number of Control reads for each peak will be the next highest integer.
for filtering by mapping quality (mapqc). Only alignments with mappig scores of at least this value will be included. Only applicable for bam files when bUseSummarizeOverlaps=FALSE (setting DBA$config$scanbamparam appropriately to filter on quality scores when using summarizeOverlaps.)
function that will be invoked for each interval with a vector of scores for each sample. Returns a score that will be evaluated against the filter value (only intervals with a score at least as high as filter will be kept). Default is max, indicating that at least one sample should have a score of at least filter; other useful values include sum (indicating that all the scores added together should be at least filter) and mean (setting a minimum mean normalized count level). Users can supply their own function as well.
logical indicating whether to plot a correlation heatmap for the counted data
logical indicating that summarizeOverlaps should be used for counting instead of the built-in counting code. This option is slower but uses the more standard counting function. If TRUE, all read files must be BAM (.bam extension), with associated index files (.bam.bai extension). The insertLength parameter must absent.

See notes for when the bRemoveDuplicates parameter is set to TRUE, where the built-in counting code may not correctly handle certain cases and bUseSummarizeOverlaps should be set to TRUE.

Five additional parameters for summarizeOverlaps may be specified in DBA$config:

yieldSize indicating how many reads to process at one time; default is 5000000. The lower this value, the less memory will be used, but the more time it will take to complete the count operation.
mode indicating which overlap algorithm to use; default is "IntersectionNotEmpty"
logical indicating if reads are single end; default is TRUE
logical indicating how unmatched reads are counted; default is FALSE
ScanBamParam object to pass to summarizeOverlaps. If present, bRemoveDuplicates is ignored.

Specify the file type of the read files, over-riding the file extension. Possible values:
use file extension (.bam, .bed, .bed.gz) to determine file type
assume the file type is BAM, regardless of the file extension
Note that if readFormat is anything other than DBA_READS_DEFAULT, all the read files must be of the same file type.
if TRUE, use multicore to get counts for each read file in parallel

DBA object with binding affinity matrix based on read count scores.

See Also


  • dba.count
# These won't run unless you have the reads available in a BAM or BED file
## Not run: tamoxifen <- dba.count(tamoxifen)

# Count using a peakset made up of only peaks in all responsive MCF7 replicates
mcf7Common <- dba.overlap(tamoxifen,tamoxifen$masks$MCF7&tamoxifen$masks$Responsive)
## Not run: tamoxifen <- dba.count(tamoxifen,peaks=mcf7Common$inAll)

#First make consensus peaksets from each set of replicates, 
#then derive master consensus set for counting from those
tamoxifen <- dba.peakset(tamoxifen,consensus = -DBA_REPLICATE)
## Not run: tamoxifen <- dba.count(tamoxifen, peaks=tamoxifen$masks$Consensus)

# Change binding affinity scores
tamoxifen <- dba.count(tamoxifen,peaks=NULL,score=DBA_SCORE_READS)
dba.peakset(tamoxifen, bRetrieve=TRUE)
tamoxifen <- dba.count(tamoxifen,peaks=NULL,score=DBA_SCORE_RPKM_FOLD)
dba.peakset(tamoxifen, bRetrieve=TRUE)
tamoxifen <- dba.count(tamoxifen,peaks=NULL,score=DBA_SCORE_TMM_MINUS_FULL)
dba.peakset(tamoxifen, bRetrieve=TRUE)

# Plot effect of a range of filter values and then apply filter 
rate.max <- dba.count(tamoxifen, peaks=NULL, filter=0:250)
rate.sum <- dba.count(tamoxifen, peaks=NULL, filter=0:250,filterFun=sum)
plot(0:250,rate.max/rate.max[1],type='l',xlab="Filter Value",ylab="Proportion Retained Sites")
tamoxifen <- dba.count(tamoxifen,peaks=NULL,filter=125,filterFun=sum)
Documentation reproduced from package DiffBind, version 2.0.2, License: Artistic-2.0

Community examples

Looks like there are no examples yet.