Count reads in binding site intervals
Counts reads in binding site intervals. Files must be one of bam, bed and gzip-compressed bed. File suffixes must be ".bam", ".bed", or ".bed.gz" respectively.
dba.count(DBA, peaks, minOverlap=2, score=DBA_SCORE_TMM_MINUS_FULL, bLog=FALSE, fragmentSize=DBA$config$fragmentSize, summits, filter=0, bRemoveDuplicates=FALSE, bScaleControl=TRUE, mapQCth=DBA$config$mapQCth, filterFun=max, bCorPlot=DBA$config$bCorPlot, bUseSummarizeOverlaps=FALSE, readFormat=DBA_READS_DEFAULT, bParallel=DBA$config$RunParallel)
- DBA object
If GRanges, RangedData, dataframe, or matrix, this parameter contains the intervals to use for counting. If character string, it specifies a file containing the intervals to use (with the first three columns specifying chromosome, startpos, endpos).If missing or a mask, generates a consensus peakset using minOverlap parameter (after applying the mask if present). If NULL, the
summitsparameters are honored, updating the global binding matrix without re-counting in the cases of
filter, and only counting after re-centering in the case of
- only include peaks in at least this many peaksets when generating consensus peakset (i.e. when peaks parameter is missing). If minOverlap is between zero and one, peak will be included from at least this proportion of peaksets.
which score to use in the binding affinity matrix. Note that all raw read counts are maintained for use by
dba.analyze, regardless of how this is set. One of:
DBA_SCORE_READS raw read count for interval using only reads from ChIP DBA_SCORE_READS_FOLD raw read count for interval from ChIP divided by read count for interval from control DBA_SCORE_READS_MINUS raw read count for interval from ChIP minus read count for interval from control DBA_SCORE_RPKM RPKM for interval using only reads from ChIP DBA_SCORE_RPKM_FOLD RPKM for interval from ChIP divided by RPKM for interval from control DBA_SCORE_TMM_READS_FULL TMM normalized (using edgeR), using ChIP read counts and Full Library size DBA_SCORE_TMM_READS_EFFECTIVE TMM normalized (using edgeR), using ChIP read counts and Effective Library size DBA_SCORE_TMM_MINUS_FULL TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Full Library size DBA_SCORE_TMM_MINUS_EFFECTIVE TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Effective Library size DBA_SCORE_TMM_READS_FULL_CPM same as
DBA_SCORE_TMM_READS_FULL, but reporrted in counts-per-million.
DBA_SCORE_TMM_READS_EFFECTIVE_CPM same as
DBA_SCORE_TMM_READS_EFFECTIVE, but reporrted in counts-per-million.
DBA_SCORE_TMM_MINUS_FULL_CPM same as
DBA_SCORE_TMM_MINUS_FULL, but reporrted in counts-per-million.
DBA_SCORE_TMM_MINUS_EFFECTIVE_CPM Tsame as
DBA_SCORE_TMM_MINUS_EFFECTIVE, but reporrted in counts-per-million.
DBA_SCORE_SUMMIT summit height (maximum read pileup value) DBA_SCORE_SUMMIT_ADJ summit height (maximum read pileup value), normalized to relative library size DBA_SCORE_SUMMIT_POS summit position (location of maximum read pileup)
- logical indicating whether log2 of score should be used (only applies to DBA_SCORE_RPKM_FOLD and DBA_SCORE_READS_FOLD).
This value will be used as the length of the reads.
Each read will be extended from its endpoint along the appropriate strand by this many bases.
If set to zero, the read size indicated in the BAM/BED file will be used.
fragmentSizemay also be a vector of values, one for each ChIP sample plus one for each unique Control library.
if present, summit heights (read pileup) and locations will be calculated for each peak. The values can retrieved using
dba.peakset. The summits can also be used as a read score in the global binding matrix (see
If the value of
0), the summits will be calculated but the peaksets will be unaffected. If the value is greater than zero, all consensus peaks will be re-centered around a consensus summit, with the value of
summitsindicating how many base pairs to include upstream and downstream of the summit (so all consensus peaks will be of the same width, namely
2 * summits).
Note that if
summitsis greater than zero, the counting procedure will take twice as long, and
value to use for filtering intervals with low read counts.
Only intervals where at least one sample has a
scoreat least this high will be included. If
peaksis NULL, will remove sites from existing DBA object without recounting. If filter is a vector of values,
dba.countwill return a vector of the same length, indicating how many intervals will be retained for each specified
logical indicating if duplicate reads (ones that map to exactly the same genomic position) should be removed. If TRUE, any location where multiple reads map will be counted as a single read. Note that if
bLowMemis set, duplicates needs to have been already marked in all of the BAM files. The built-in counting code may not correctly handle certain cases when the
bRemoveDuplicatesparamter is set to
TRUE. These cases include paried-end data and datasets where the read length may differ within a single BAM file. In these cases, see the
- logical indicating if the Control reads should be scaled based on relative library sizes. If TRUE, and there are more reads in the Control library than in the ChIP library, the number of Control reads for each peak will be multiplied by a scaling factor determined by dividing the total number of reads in the ChIP library by the total number of reads in the Control library. If this value is not an integer, the number of Control reads for each peak will be the next highest integer.
for filtering by mapping quality (mapqc). Only alignments with mappig scores of at least this value will be included. Only applicable for bam files when
DBA$config$scanbamparamappropriately to filter on quality scores when using
function that will be invoked for each interval with a vector of scores for each sample. Returns a score that will be evaluated against the
filtervalue (only intervals with a score at least as high as
filterwill be kept). Default is
max, indicating that at least one sample should have a score of at least
filter; other useful values include
sum(indicating that all the scores added together should be at least
mean(setting a minimum mean normalized count level). Users can supply their own function as well.
- logical indicating whether to plot a correlation heatmap for the counted data
logical indicating that
summarizeOverlapsshould be used for counting instead of the built-in counting code. This option is slower but uses the more standard counting function. If
TRUE, all read files must be BAM (.bam extension), with associated index files (.bam.bai extension). The
insertLengthparameter must absent.
See notes for when the
bRemoveDuplicatesparameter is set to
TRUE, where the built-in counting code may not correctly handle certain cases and
bUseSummarizeOverlapsshould be set to
Five additional parameters for
summarizeOverlapsmay be specified in
DBA$config$yieldSize yieldSize indicating how many reads to process at one time; default is 5000000. The lower this value, the less memory will be used, but the more time it will take to complete the count operation. DBA$config$intersectMode mode indicating which overlap algorithm to use; default is
DBA$config$singleEnd logical indicating if reads are single end; default is
DBA$config$fragments logical indicating how unmatched reads are counted; default is
ScanBamParamobject to pass to
summarizeOverlaps. If present,
Specify the file type of the read files, over-riding the file extension. Possible values:
Note that if
DBA_READS_DEFAULT use file extension (.bam, .bed, .bed.gz) to determine file type DBA_READS_BAM assume the file type is BAM, regardless of the file extension
readFormatis anything other than
DBA_READS_DEFAULT, all the read files must be of the same file type.
- if TRUE, use multicore to get counts for each read file in parallel
DBA object with binding affinity matrix based on read count scores.
# These won't run unless you have the reads available in a BAM or BED file data(tamoxifen_peaks) ## Not run: tamoxifen <- dba.count(tamoxifen) # Count using a peakset made up of only peaks in all responsive MCF7 replicates data(tamoxifen_peaks) mcf7Common <- dba.overlap(tamoxifen,tamoxifen$masks$MCF7&tamoxifen$masks$Responsive) ## Not run: tamoxifen <- dba.count(tamoxifen,peaks=mcf7Common$inAll) tamoxifen #First make consensus peaksets from each set of replicates, #then derive master consensus set for counting from those data(tamoxifen_peaks) tamoxifen <- dba.peakset(tamoxifen,consensus = -DBA_REPLICATE) ## Not run: tamoxifen <- dba.count(tamoxifen, peaks=tamoxifen$masks$Consensus) tamoxifen # Change binding affinity scores data(tamoxifen_counts) tamoxifen <- dba.count(tamoxifen,peaks=NULL,score=DBA_SCORE_READS) dba.peakset(tamoxifen, bRetrieve=TRUE) tamoxifen <- dba.count(tamoxifen,peaks=NULL,score=DBA_SCORE_RPKM_FOLD) dba.peakset(tamoxifen, bRetrieve=TRUE) tamoxifen <- dba.count(tamoxifen,peaks=NULL,score=DBA_SCORE_TMM_MINUS_FULL) dba.peakset(tamoxifen, bRetrieve=TRUE) # Plot effect of a range of filter values and then apply filter data(tamoxifen_counts) rate.max <- dba.count(tamoxifen, peaks=NULL, filter=0:250) rate.sum <- dba.count(tamoxifen, peaks=NULL, filter=0:250,filterFun=sum) plot(0:250,rate.max/rate.max,type='l',xlab="Filter Value",ylab="Proportion Retained Sites") lines(0:250,rate.sum/rate.sum,col=2) tamoxifen <- dba.count(tamoxifen,peaks=NULL,filter=125,filterFun=sum) tamoxifen