normalizeGeneLength: Normalize for gene length

Description

Normalize for gene length using the output of transcript abundance estimators

Usage

normalizeGeneLength(object, files, level = c("tx", "gene"), geneIdCol = "gene_id", lengthCol = "length", abundanceCol = "FPKM", dropGenes = FALSE, importer, ...)

Arguments

object

the DESeqDataSet, before calling DESeq

files

a character vector specifying the filenames of output files containing either transcript abundance estimates with transcript length, or average transcript length information per gene.

level

either "tx" or "gene"

geneIdCol

the name of the column of the files specifying the gene id. This should line up with the rownames(object). The information in the files will be re-ordered to line up with the rownames of the object. See dropGenes for more details.

lengthCol

the name of the column of files specifying the length of the feature, either transcript for level="tx" or the gene for level="gene".

abundanceCol

only needed if level="tx", the name of the column specifying the abundance estimate of the transcript.

dropGenes

whether to drop genes from the object, as labelled by rownames(object), which are not present in the geneIdCol of the files. Defaults to FALSE and gives an error upon finding rownames of the object not present in the geneIdCol of the files. The function will reorder the matching rows of the files to match the rownames of the object.

importer

a function to read the files. fread from the data.table package is quite fast, but other options include read.table, read.csv. One can pre-test with importer(files[1]).

...

further arguments passed to importer

Value

a DESeqDataSet with normalizationFactors accounting for average transcript length and library size

Details

This is a prototype function for importing information about changes in the average transcript length for each gene. The use of this function is only for testing purposes.

The function simply imports or calculates average transcript length for each gene and each sample from external files, and provides this matrix to the normMatrix argument of estimateSizeFactors. By average transcript length, the average refers to a weighted average with respect to the transcript abundances. The RSEM method includes such a column in their gene.results files, but an estimate of average transcript length can be obtained from any software which outputs a file with a row for each transcript, specifying: transcript length, estimate of transcript abundance, and the gene which the transcript belongs to.

Normalization factors accounting for both average transcript length and library size of each sample are generated and then stored within the data object. The analysis can then continue with DESeq; the stored normalization factors will be used instead of size factors in the analysis.

For RSEM genes.results files, specify level="gene", geneIdCol="gene_id", and lengthCol="effective_length"

For Cufflinks isoforms.fpkm_tracking files, specify level="tx", geneIdCol="gene_id", lengthCol="length", and abundanceCol="FPKM".

For Sailfish output files, one can write an importer function which attaches a column gene_id based on Transcript ID, and then specify level="tx", geneIdCol="gene_id", lengthCol="Length" and abundanceCol="RPKM".

Along with the normalization matrix which is stored in normalizationFactors(object), the resulting gene length matrix is stored in assays(dds)[["avgTxLength"]], and will take precedence in calls to fpkm.

Examples

Run this code

n <- 10
files <- c("sample1","sample2")
gene_id <- rep(paste0("gene",seq_len(n)),each=3)
set.seed(1)
sample1 <- data.frame(gene_id=gene_id,length=rpois(3*n,2000),FPKM=round(rnorm(3*n,10,1),2))
sample2 <- data.frame(gene_id=gene_id,length=rpois(3*n,2000),FPKM=round(rnorm(3*n,10,1),2))
importer <- get
dds <- makeExampleDESeqDataSet(n=n, m=2)
dds <- normalizeGeneLength(dds, files=files, level="tx",
  geneIdCol="gene_id", lengthCol="length", abundanceCol="FPKM",
  dropGenes=TRUE, importer=importer)

Run the code above in your browser using DataLab