computeRPKM: Compute RPKM based on gene annotations

Description

Given a list of single-end or paired-end read alignment files in BAM/SAM/BED format, compute the read counts and normalized read counts as expression of annotated transcript in the unit of "reads per kilobase of exon per million mapped reads" (RPKM).

Usage

computeRPKM(bamFiles, RIPSeekerRead = TRUE, paired = FALSE, 
	countMode = "IntersectionNotEmpty", featureGRanges, 
	idType = "ensembl_transcript_id", featureType = "exon", 
	ignore.strand = FALSE, txDbName = "biomart", 
	moreGeneInfo = FALSE, saveData, justRPKM = TRUE, ...)

Arguments

bamFiles

A list of one or more BAM/SAM/BED alignment files.

RIPSeekerRead

Binary flag. If TRUE, then import and process the alignment files using the built-in function combineAlignGals from RIPSeeker package; if FALSE, then import the files by directly calling the required functions. The flag makes using the function outside of RIPSeeker package become possible.

paired

Binary to indicate whether the alignments files are paired-end. The alignments file must be either paired-end or single-end but not both.

countMode

An argument used to set the mode argument in the underlying function summarizeOverlaps employed to compute the read counts for each feature. The possible mode includes "Union", "IntersectionStrict", and "IntersectionNotEmpty". All three modes avoid double counting the reads by either discarding reads that completely fall into multiple features or counting the read only once for the feature that uniquely and completely includes it. Please refer to summarizeOverlaps for details.

featureGRanges

GRanges of features as an optional argument for function to compute RPKM/FPKM just for those features without retrieving online annotations.

idType

A character string that specifies the type of the annotations, which can "ensembl_transcript_id", "ensembl_gene_id", "ucsc", etc. Refer to listFilters for more information.

featureType

Features that will be groupped by genes/transcripts in a GRangesList. The available options are "exon" (Default), "intron", "fiveUTR", "threeUTR", and "CDS" corresponding to the functions exonsBy, cdsBy, intronsByTranscript, fiveUTRsByTranscript, threeUTRsByTranscript, and cdsBy, respectively.

ignore.strand

Whether to ignore strand when counting the reads (Default: FALSE).

txDbName

Name of the transcript database to use to retreive the annotation. The available options are "biomart" (Default) or "UCSC" corresponding to the functions makeTxDbFromBiomart and makeTxDbFromUCSC, respectively.

moreGeneInfo

Binary indicator to indicate whether to download more information for each genes/transcripts rather than having only the gene/transcript IDs (Default: FALSE).

saveData

Path of output file.

justRPKM

Binary for whether to return only the RangedSummarizedExperiment.

...

Extra arguments passed to functions makeTxDbFromBiomart, makeTxDbFromUCSC, useMart, combineAlignGals.

Value

rpkmSEobjectA RangedSummarizedExperiment object with assays slot saved for counts, rowRanges holds the features, metadata for RPKM/FPKM (normalized) gene expression.
rpkmDFData frame with or without the detailed gene information columns depending on whether moreGeneInfo is TRUE or FALSE. rpkmDF is only returned within in a list when justRPKM is FALSE.
featureGRangesThe features in GRanges object that are used to compute the gene expression. featureGRanges is only returned within in a list when justRPKM is FALSE.

Details

The function is a wrapper function making use of several external functions from several well maintained and freely available Bioconductor packages including GenomicFeatures, GenomicRanges, biomaRt and Rsamtools packages. The paired-end alignments are converted into single-end using function galp2gal and then subject to read count computation by summarizeOverlaps, which does not yet directly support paired-end alignments.

References

M. Carlson, H. Pages, P. Aboyoun, S. Falcon, M. Morgan, D. Sarkar and M. Lawrence. GenomicFeatures: Tools for making and manipulating transcript centric annotations. R package version 1.8.2.

P. Aboyoun, H. Pages and M. Lawrence (). GenomicRanges: Representation and manipulation of genomic intervals. R package version 1.8.9.

Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Steffen Durinck, Paul T. Spellman, Ewan Birney and Wolfgang Huber, Nature Protocols 4, 1184-1191 (2009).

BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Steffen Durinck, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma and Wolfgang Huber, Bioinformatics 21, 3439-3440 (2005).

Martin Morgan and Herv'e Pag`es (). Rsamtools: Binary alignment (BAM), variant call (BCF), or tabix file import. R package version 1.8.5. http://bioconductor.org/packages/release/bioc/html/Rsamtools.html

Examples

Run this code

if(interactive()) { # need internet connection
	
# Retrieve system files
extdata.dir <- system.file("extdata", package="RIPSeeker") 

bamFiles <- list.files(extdata.dir, ".bam$", recursive=TRUE, full.names=TRUE)

bamFiles <- grep("PRC2", bamFiles, value=TRUE)

# use biomart
txDbName <- "biomart"
biomart <- "ENSEMBL_MART_ENSEMBL"		# use archive to get ensembl 65
dataset <- "mmusculus_gene_ensembl"		
host <- "dec2011.archive.ensembl.org" 	# use ensembl 65 for annotation

resultlist <- computeRPKM(bamFiles=grep(pattern="SRR039214", 
        bamFiles, value=TRUE, invert=TRUE), #featureGRanges=featureGRanges, 
				dataset=dataset, moreGeneInfo=TRUE, justRPKM=FALSE,
				idType="ensembl_transcript_id", txDbName=txDbName, 
				biomart=biomart, host=host, by="tx")
		
}

Run the code above in your browser using DataLab