normOffsets: Normalize counts between libraries

Description

Calculate normalization factors or offsets using count data from multiple libraries.

Usage

"normOffsets"(object, lib.sizes=NULL, type=c("scaling", "loess"),  weighted=FALSE, ...)

Arguments

object

a matrix of integer counts with one column per library

lib.sizes

a numeric vector specifying the total number of reads per library

type

a character string indicating what type of normalization is to be performed

weighted

a logical scalar indicating whether precision weights should be used for TMM normalization

...

other arguments to be passed to calcNormFactors for type="scaling", or loessFit for type="loess"

Value

For type="scaling", a numeric vector containing the relative normalization factors for each library.For type="loess", a numeric matrix of the same dimensions as counts, containing the log-based offsets for use in GLM fitting.

Details

If type="scaling", this function provides a convenience wrapper for the calcNormFactors function in the edgeR package. Specifically, it uses the trimmed mean of M-values (TMM) method to perform normalization. Precision weighting is turned off by default so as to avoid upweighting high-abundance regions. These are more likely to be bound and thus more likely to be differentially bound. Assigning excessive weight to such regions will defeat the purpose of trimming when normalizing the coverage of background regions.

If type="loess", this function performs non-linear normalization similar to the fast loess algorithm in normalizeCyclicLoess. For each sample, a lowess curve is fitted to the log-counts against the log-average count. The fitted value for each bin pair is used as the generalized linear model offset for that sample. The use of the average count provides more stability than the average log-count when low counts are present for differentially bound regions.

If lib.sizes is not specified, a warning is issued and the column sums of counts are used instead. Note that the same lib.sizes should be used throughout the analysis if normOffsets is called multiple times on the same libraries, e.g., with different bin or window sizes or after different filtering steps. This ensures that the normalization factors or offsets are comparable between calls.

References

Robinson MD, Oshlack A (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11, R25.

Ballman KV, Grill DE, Oberg AL, Therneau TM (2004). Faster cyclic loess: normalizing RNA arrays via linear models. Bioinformatics 20, 2778-86.

Examples

Run this code

# A trivial example
counts <- matrix(rnbinom(400, mu=10, size=20), ncol=4)
normOffsets(counts)
normOffsets(counts, lib.sizes=rep(400, 4))

# Adding undersampling
n <- 1000L
mu1 <- rep(10, n)
mu2 <- mu1
mu2[1:100] <- 100
mu2 <- mu2/sum(mu2)*sum(mu1)
counts <- cbind(rnbinom(n, mu=mu1, size=20), rnbinom(n, mu=mu2, size=20))
actual.lib.size <- rep(sum(mu1), 2)
normOffsets(counts, lib.sizes=actual.lib.size)
normOffsets(counts, logratioTrim=0.4, lib.sizes=actual.lib.size)
normOffsets(counts, sumTrim=0.3, lib.size=actual.lib.size)

# With and without weighting, for high-abundance spike-ins.
n <- 100000
blah <- matrix(rnbinom(2*n, mu=10, size=20), ncol=2)
tospike <- 10000
blah[1:tospike,1] <- rnbinom(tospike, mu=1000, size=20)
blah[1:tospike,2] <- rnbinom(tospike, mu=2000, size=20)
full.lib.size <- colSums(blah)

normOffsets(blah, weighted=TRUE, lib.sizes=full.lib.size)
normOffsets(blah, lib.sizes=full.lib.size)
true.value <- colSums(blah[(tospike+1):n,])/colSums(blah)
true.value <- true.value/exp(mean(log(true.value)))
true.value

# Using loess-based normalization, instead.
offsets <- normOffsets(counts, type="loess", lib.size=full.lib.size)
head(offsets)
offsets <- normOffsets(counts, type="loess", span=0.4, lib.size=full.lib.size)
offsets <- normOffsets(counts, type="loess", iterations=1, lib.size=full.lib.size)

Run the code above in your browser using DataLab