DiffBind (version 2.0.2)

dba.count: Count reads in binding site intervals

Description

Counts reads in binding site intervals. Files must be one of bam, bed and gzip-compressed bed. File suffixes must be ".bam", ".bed", or ".bed.gz" respectively.

Usage

dba.count(DBA, peaks, minOverlap=2, score=DBA_SCORE_TMM_MINUS_FULL, bLog=FALSE, fragmentSize=DBA$config$fragmentSize, summits, filter=0, bRemoveDuplicates=FALSE, bScaleControl=TRUE, mapQCth=DBA$config$mapQCth, filterFun=max, bCorPlot=DBA$config$bCorPlot, bUseSummarizeOverlaps=FALSE, readFormat=DBA_READS_DEFAULT, bParallel=DBA$config$RunParallel)

Arguments

DBA
DBA object
peaks
If GRanges, RangedData, dataframe, or matrix, this parameter contains the intervals to use for counting. If character string, it specifies a file containing the intervals to use (with the first three columns specifying chromosome, startpos, endpos).If missing or a mask, generates a consensus peakset using minOverlap parameter (after applying the mask if present). If NULL, the score, filter, and summits parameters are honored, updating the global binding matrix without re-counting in the cases of score and filter, and only counting after re-centering in the case of summits.
minOverlap
only include peaks in at least this many peaksets when generating consensus peakset (i.e. when peaks parameter is missing). If minOverlap is between zero and one, peak will be included from at least this proportion of peaksets.
score
which score to use in the binding affinity matrix. Note that all raw read counts are maintained for use by dba.analyze, regardless of how this is set. One of:
DBA_SCORE_READS
raw read count for interval using only reads from ChIP
DBA_SCORE_READS_FOLD
raw read count for interval from ChIP divided by read count for interval from control
DBA_SCORE_READS_MINUS
raw read count for interval from ChIP minus read count for interval from control
DBA_SCORE_RPKM
RPKM for interval using only reads from ChIP
DBA_SCORE_RPKM_FOLD
RPKM for interval from ChIP divided by RPKM for interval from control
DBA_SCORE_TMM_READS_FULL
TMM normalized (using edgeR), using ChIP read counts and Full Library size
DBA_SCORE_TMM_READS_EFFECTIVE
TMM normalized (using edgeR), using ChIP read counts and Effective Library size
DBA_SCORE_TMM_MINUS_FULL
TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Full Library size
DBA_SCORE_TMM_MINUS_EFFECTIVE
TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Effective Library size
DBA_SCORE_TMM_READS_FULL_CPM
same as DBA_SCORE_TMM_READS_FULL, but reporrted in counts-per-million.
DBA_SCORE_TMM_READS_EFFECTIVE_CPM
same as DBA_SCORE_TMM_READS_EFFECTIVE, but reporrted in counts-per-million.
DBA_SCORE_TMM_MINUS_FULL_CPM
same as DBA_SCORE_TMM_MINUS_FULL, but reporrted in counts-per-million.
DBA_SCORE_TMM_MINUS_EFFECTIVE_CPM
Tsame as DBA_SCORE_TMM_MINUS_EFFECTIVE, but reporrted in counts-per-million.
DBA_SCORE_SUMMIT
summit height (maximum read pileup value)
DBA_SCORE_SUMMIT_ADJ
summit height (maximum read pileup value), normalized to relative library size
DBA_SCORE_SUMMIT_POS
summit position (location of maximum read pileup)
bLog
logical indicating whether log2 of score should be used (only applies to DBA_SCORE_RPKM_FOLD and DBA_SCORE_READS_FOLD).
fragmentSize
This value will be used as the length of the reads. Each read will be extended from its endpoint along the appropriate strand by this many bases. If set to zero, the read size indicated in the BAM/BED file will be used. fragmentSize may also be a vector of values, one for each ChIP sample plus one for each unique Control library.
summits
if present, summit heights (read pileup) and locations will be calculated for each peak. The values can retrieved using dba.peakset. The summits can also be used as a read score in the global binding matrix (see score).

If the value of summits is TRUE (or 0), the summits will be calculated but the peaksets will be unaffected. If the value is greater than zero, all consensus peaks will be re-centered around a consensus summit, with the value of summits indicating how many base pairs to include upstream and downstream of the summit (so all consensus peaks will be of the same width, namely 2 * summits).

Note that if summits is greater than zero, the counting procedure will take twice as long, and bUseSummarizeOverlaps must be FALSE.

filter
value to use for filtering intervals with low read counts. Only intervals where at least one sample has a score at least this high will be included. If peaks is NULL, will remove sites from existing DBA object without recounting. If filter is a vector of values, dba.count will return a vector of the same length, indicating how many intervals will be retained for each specified filter level.
bRemoveDuplicates
logical indicating if duplicate reads (ones that map to exactly the same genomic position) should be removed. If TRUE, any location where multiple reads map will be counted as a single read. Note that if bLowMem is set, duplicates needs to have been already marked in all of the BAM files. The built-in counting code may not correctly handle certain cases when the bRemoveDuplicates paramter is set to TRUE. These cases include paried-end data and datasets where the read length may differ within a single BAM file. In these cases, see the bUseSummarizeOverlaps parameter.
bScaleControl
logical indicating if the Control reads should be scaled based on relative library sizes. If TRUE, and there are more reads in the Control library than in the ChIP library, the number of Control reads for each peak will be multiplied by a scaling factor determined by dividing the total number of reads in the ChIP library by the total number of reads in the Control library. If this value is not an integer, the number of Control reads for each peak will be the next highest integer.
mapQCth
for filtering by mapping quality (mapqc). Only alignments with mappig scores of at least this value will be included. Only applicable for bam files when bUseSummarizeOverlaps=FALSE (setting DBA$config$scanbamparam appropriately to filter on quality scores when using summarizeOverlaps.)
filterFun
function that will be invoked for each interval with a vector of scores for each sample. Returns a score that will be evaluated against the filter value (only intervals with a score at least as high as filter will be kept). Default is max, indicating that at least one sample should have a score of at least filter; other useful values include sum (indicating that all the scores added together should be at least filter) and mean (setting a minimum mean normalized count level). Users can supply their own function as well.
bCorPlot
logical indicating whether to plot a correlation heatmap for the counted data
bUseSummarizeOverlaps
logical indicating that summarizeOverlaps should be used for counting instead of the built-in counting code. This option is slower but uses the more standard counting function. If TRUE, all read files must be BAM (.bam extension), with associated index files (.bam.bai extension). The insertLength parameter must absent.

See notes for when the bRemoveDuplicates parameter is set to TRUE, where the built-in counting code may not correctly handle certain cases and bUseSummarizeOverlaps should be set to TRUE.

Five additional parameters for summarizeOverlaps may be specified in DBA$config:

DBA$config$yieldSize
yieldSize indicating how many reads to process at one time; default is 5000000. The lower this value, the less memory will be used, but the more time it will take to complete the count operation.
DBA$config$intersectMode
mode indicating which overlap algorithm to use; default is "IntersectionNotEmpty"
DBA$config$singleEnd
logical indicating if reads are single end; default is TRUE
DBA$config$fragments
logical indicating how unmatched reads are counted; default is FALSE
DBA$config$scanbamparam
ScanBamParam object to pass to summarizeOverlaps. If present, bRemoveDuplicates is ignored.

readFormat
Specify the file type of the read files, over-riding the file extension. Possible values:
DBA_READS_DEFAULT
use file extension (.bam, .bed, .bed.gz) to determine file type
DBA_READS_BAM
assume the file type is BAM, regardless of the file extension
Note that if readFormat is anything other than DBA_READS_DEFAULT, all the read files must be of the same file type.
bParallel
if TRUE, use multicore to get counts for each read file in parallel

Value

DBA object with binding affinity matrix based on read count scores.

See Also

dba.analyze

Examples

Run this code
# These won't run unless you have the reads available in a BAM or BED file
data(tamoxifen_peaks)
## Not run: tamoxifen <- dba.count(tamoxifen)


# Count using a peakset made up of only peaks in all responsive MCF7 replicates
data(tamoxifen_peaks)
mcf7Common <- dba.overlap(tamoxifen,tamoxifen$masks$MCF7&tamoxifen$masks$Responsive)
## Not run: tamoxifen <- dba.count(tamoxifen,peaks=mcf7Common$inAll)
tamoxifen

#First make consensus peaksets from each set of replicates, 
#then derive master consensus set for counting from those
data(tamoxifen_peaks)
tamoxifen <- dba.peakset(tamoxifen,consensus = -DBA_REPLICATE)
## Not run: tamoxifen <- dba.count(tamoxifen, peaks=tamoxifen$masks$Consensus)
tamoxifen

# Change binding affinity scores
data(tamoxifen_counts)
tamoxifen <- dba.count(tamoxifen,peaks=NULL,score=DBA_SCORE_READS)
dba.peakset(tamoxifen, bRetrieve=TRUE)
tamoxifen <- dba.count(tamoxifen,peaks=NULL,score=DBA_SCORE_RPKM_FOLD)
dba.peakset(tamoxifen, bRetrieve=TRUE)
tamoxifen <- dba.count(tamoxifen,peaks=NULL,score=DBA_SCORE_TMM_MINUS_FULL)
dba.peakset(tamoxifen, bRetrieve=TRUE)

# Plot effect of a range of filter values and then apply filter 
data(tamoxifen_counts)
rate.max <- dba.count(tamoxifen, peaks=NULL, filter=0:250)
rate.sum <- dba.count(tamoxifen, peaks=NULL, filter=0:250,filterFun=sum)
plot(0:250,rate.max/rate.max[1],type='l',xlab="Filter Value",ylab="Proportion Retained Sites")
lines(0:250,rate.sum/rate.sum[1],col=2)
tamoxifen <- dba.count(tamoxifen,peaks=NULL,filter=125,filterFun=sum)
tamoxifen

Run the code above in your browser using DataCamp Workspace