RUVs-methods: Remove Unwanted Variation Using Replicate/Negative Control Samples

Description

This function implements the RUVs method of Risso et al. (2014).

Usage

RUVs(x, cIdx, k, scIdx, round=TRUE, epsilon=1, tolerance=1e-8, isLog=FALSE)

Arguments

Either a genes-by-samples numeric matrix or a SeqExpressionSet object containing the read counts.

cIdx

A character, logical, or numeric vector indicating the subset of genes to be used as negative controls in the estimation of the factors of unwanted variation.

The number of factors of unwanted variation to be estimated from the data.

scIdx

A numeric matrix specifying the replicate samples for which to compute the count differences used to estimate the factors of unwanted variation (see details).

round

If TRUE, the normalized measures are rounded to form pseudo-counts.

epsilon

A small constant (usually no larger than one) to be added to the counts prior to the log transformation to avoid problems with log(0).

tolerance

Tolerance in the selection of the number of positive singular values, i.e., a singular value must be larger than tolerance to be considered positive.

isLog

Set to TRUE if the input matrix is already log-transformed. Ignored if x is a SeqExpressionSet.

Methods

signature(x = "matrix", cIdx = "ANY", k = "numeric", scIdx = "matrix")

It returns a list with

A samples-by-factors matrix with the estimated factors of unwanted variation (W).
The genes-by-samples matrix of normalized expression measures (possibly rounded) obtained by removing the factors of unwanted variation from the original read counts (normalizedCounts).

signature(x = "SeqExpressionSet", cIdx = "character", k="numeric", scIdx = "matrix")

It returns a SeqExpressionSet with

The normalized counts in the normalizedCounts slot.
The estimated factors of unwanted variation as additional columns of the phenoData slot.

Details

The RUVs procedure performs factor analysis on a matrix of count differences for replicate/negative control samples, for which the biological covariates of interest are constant. Each row of scIdx should correspond to a set of replicate samples. The number of columns is the size of the largest set of replicates; rows for smaller sets are padded with -1 values.

For example, if the sets of replicate samples are (1,11,21),(2,3),(4,5),(6,7,8), then scIdx should be

1 11 21 2 3 -1 4 5 -1 6 7 8

References

D. Risso, J. Ngai, T. P. Speed, and S. Dudoit. Normalization of RNA-seq data using factor analysis of control genes or samples. Nature Biotechnology, 2014. (In press).

D. Risso, J. Ngai, T. P. Speed, and S. Dudoit. The role of spike-in standards in the normalization of RNA-Seq. In D. Nettleton and S. Datta, editors, Statistical Analysis of Next Generation Sequence Data. Springer, 2014. (In press).

Examples

Run this code

library(zebrafishRNASeq)
data(zfGenes)

## run on a subset of genesfor time reasons 
## (real analyses should be performed on all genes)
genes <- rownames(zfGenes)[grep("^ENS", rownames(zfGenes))]
spikes <- rownames(zfGenes)[grep("^ERCC", rownames(zfGenes))]
set.seed(123)
idx <- c(sample(genes, 1000), spikes)
seq <- newSeqExpressionSet(as.matrix(zfGenes[idx,]))

# RUVs normalization
controls <- rownames(seq)
differences <- matrix(data=c(1:3, 4:6), byrow=TRUE, nrow=2)
seqRUVs <- RUVs(seq, controls, k=1, differences)

pData(seqRUVs)
head(normCounts(seqRUVs))

Run the code above in your browser using DataLab