selectBinSize: Select optimal bin size based on Shimazaki formula

Description

The function iteratively estimates the cost of increasing bin size within a defined range and finally selects the bin size with minimum cost.

Usage

selectBinSize(alignGR, minBinSize, maxBinSize = 1000, 
	increment = 5, getFullResults = FALSE)

Arguments

alignGR

GRanges object of alignments on a single chromosome

minBinSize

Minimum bin size to start with (Default: 5 * read length)

maxBinSize

Maximum bin size to end with (Default: 1000).

increment

Number of bases to increment the bin size in searching for the optimal bin size within the defined range (Default: 5).

getFullResults

Binary indicator. If TRUE, the optimal bin size (with the minimum cost), minimum cost, and all of the bin sizes considered and their costs are returned. If FALSE, only the optimal bin size is returned.

Value

When getFullResults is TRUE, return a list containing:
bestBinSizethe optimal bin size (with the minimum cost)
minCostscost of the optimal bin size
binSizesall of the bin sizes considered
costsall of the costs
When getFullResults is FALSE, only the optimal bin size (bestBinSize) is returned.

Details

Based on the preprocessed alignments for a chromosome, RIPSeeker divides the chromosome into bins of equal size $b$ and compute the number of reads that $b$ needs to be determined either empirically (e.g., based on the gel-selected length of the RNA fragment) or computationally. If the bin size is too small, the read counts fluctuates greatly, making it difficult to discern the underlying read count distribution. Additionally, input size to HMM increases as bin size decreases. A very small bin size results in a very long Markov chain of read counts to model, making the computation inefficient. On the other hand, if a bin size is too large, resolution becomes poor. Consequently, one cannot detect the local RIP region with subtle but intrinsic difference from the background, and the RIP regions tend to be too wide for designing specific primer for validation.

Intuitively, selecting an appropriate bin size for each chromosome is metaphorically equivalent to choosing an optimal intervals for building a histogram (Song, 2011). Here we implement the algorithm developed by Shimazaki and Shinomoto (2007), which is based on the goodness of the fit of the time histogram to estimate the rate of neural response of an animal to certain stimuli in a spike-in experiment. This approach has been successfully applied in a recently developed ChIP-seq program (Song and Smith, 2011). Algorithm 1 describes the pseudocode adapted from Shimazaki and Shinomoto (2007) that iteratively estimates the cost $C$ of increasing bin size $b$ within a defined range is outlined as follows.

For b = minBinSize to maxBinSize; do

1. Divide chromosome sequence into $N$ bins of width $b$. 2. Count number of read counts $x_i$ that enter the $i$'th bin. 3. Compute: $\bar{x} = \frac{1}{N}\sum_{i=1}^{N}x_{i}$ and $v = \frac{1}{N}\sum_{i=1}^{N}(x_{i} - \bar{x})^{2}$. 4. Compute: $C(b) = \frac{2\bar{x}-v}{b^{2}}$ End For Choose $b$ that minimize $C(b)$

References

Hideaki Shimazaki and Shigeru Shinomoto. A method for selecting the bin size of a time histogram. Neural computation, 19(6):1503-1527, June 2007. Qiang Song and Andrew D. Smith. Identifying dispersed epigenomic domains from ChIP-Seq data. Bioinformatics (Oxford, England), 27(6):870-871, March 2011.

Examples

Run this code

# Retrieve system files
extdata.dir <- system.file("extdata", package="RIPSeeker") 

bamFiles <- list.files(extdata.dir, ".bam$", recursive=TRUE, full.names=TRUE)

bamFiles <- grep("PRC2", bamFiles, value=TRUE)

alignGal <- getAlignGal(bamFiles[1], reverseComplement=TRUE, genomeBuild="mm9")

alignGR <- as(alignGal, "GRanges")

alignGRList <- GRangesList(as.list(split(alignGR, seqnames(alignGR))))

minBinSize <- 200

maxBinSize <- 1200

gr <- alignGRList$chrX

b <- selectBinSize(gr, minBinSize, maxBinSize, increment=100, getFullResults=TRUE)

plot(b$binSizes, b$costs)

chrname <- as.character(runValue(seqnames(gr)))

chrlen <- seqlengths(gr)[chrname]

legend("topright", box.lty=0,
		
		sprintf("%s: 1-%d;
Total mapped reads: %d;
Optimal bin size = %d bp",
		
		chrname, chrlen, length(gr), b$bestBinSize))

Run the code above in your browser using DataLab