normalize: Normalization of ChIP-seq and other count data

Description

This function implements some methods for between-sample normalization of count data. Although these methods were developed for RNA-seq data, they are also useful for ChIP-seq data normalization after reads were counted within regions or bins. Some methods may also be applied to count data after within-sample normalization (e.g. TPM or RPKM values).

Usage

## S3 method for class 'ChIPseqSet':
normalize(object, method, isLogScale = FALSE, trim = 0.3, totalCounts)
## S3 method for class 'ExpressionSet':
normalize(object, method, isLogScale = FALSE, trim = 0.3, totalCounts)

Arguments

object

An object of class ChIPseqSet or ExpressionSet that contains the raw data.

method

Normalization method, either "scale", "scaleMedianRegion", "quantile" or "tmm".

isLogScale

Indicates whether the raw data in object is already logarithmized. Default value is FALSE. Logarithmized data will be returned on the log scale, non logarithmized data will remain on its original scale.

trim

Only used if method is "tmm". Indicates the fraction of data points that should be trimmed before calculating the mean. Default value is 0.3.

totalCounts

Only used if method is "scale". A vector giving the total number of reads for each sample. The Vector's length must equal the number of samples in object. Default values are the sums over all features for each sample (i.e. colsums of object).

Value

An object of the same class as the input object with the normalized data.

Details

The following normalization methods are implemented:

scale

{Samples are scaled by a factor such that all samples have the same number $N$ of reads after normalization, where $N$ is the median number of reads observed accross all samples. If the argument totalCounts is missing, the total numbers of reads are calculated from the given data. Otherwise, the values in totalCounts are used.} scaleMedianRegion{The scaling factor $s_j$ for the $j$-th sample is defined as $$s_j = median_i \frac{k_{ij}}{\prod_{v=1}^m k_{iv}}.$$ $k_{ij}$ is the value of region $i$ in sample $j$. See Anders and Huber (2010) for details.} quantile{Quantile normalization is applied to the ChIP-seq values such that each sample has the same cdf after normalization.} tmm{The trimmed mean M-value (tmm) normalization was proposed by Robinson and Oshlack (2010). Here, the logarithm of the scaling factor for sample $i$ is calculated as the trimmed mean of $$\log(k_{i,j}/m_{j}).$$ Variable $m_{j}$ denotes the geometric mean of region $j$. Argument trim is set to 0.3 as default value, so that the smallest 15% and the largest 15% of the log ratios are trimmed before calculating the mean.}

References

Anders and Huber. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106.\ Robinson and Oshlack. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25

Examples

Run this code

set.seed(1234)
  chip <- matrix(c(rpois(20, lambda=10), rpois(20, lambda=20)), nrow=20,
                 dimnames=list(paste("feature", 1:20, sep=""), c("sample1", "sample2")))
  rowRanges <- GRanges(IRanges(start=1:20, end=1:20),
                     seqnames=c(rep("1", 20)))
  names(rowRanges) = rownames(chip)
  cSet <- ChIPseqSet(chipVals=chip, rowRanges=rowRanges)

  tmmSet <- normalize(cSet, method="tmm", trim=0.3)
  mean(log(chipVals(tmmSet))[, 1], trim=0.3) -
      mean(log(chipVals(tmmSet))[, 2], trim=0.3) < 0.01

  quantSet <- normalize(cSet, method="quantile")
  all(quantile(chipVals(quantSet)[, 1]) == quantile(chipVals(quantSet)[, 2]))

Run the code above in your browser using DataLab