rulebaseRIPSeek: Compute the RPKM and foldchange between two conditions for the annotated genes

Description

The function takes alignments in two conditions (with replicates) as input and computes the gene expression in the two conditions in the unit of RPKM (reads per kilobase of exon per million mapped reads) or FPKM for paired-end alignments (where "F" stands for the fragment the mate-pairs are derived from), and then the foldchange ratio of the RPKM of each gene in RIP or treatment condition in general over control condition. The control files (i.e. the denominator in the foldchange ratio) is specified by user in the "cNAME" argument.

Usage

rulebaseRIPSeek(bamFiles, cNAME, featureGRanges, rpkmCutoff = 0.4, 
	fcCutoff = 3, moreRIPGeneInfo = TRUE, idType = "ensembl_transcript_id", 
	myMin = .Machine$double.xmin, saveRData, ...)

Arguments

bamFiles

A list of one or more BAM/SAM/BED alignment files.

cNAME

An identifer pattern found in the control alignment files. Once specified, these files will be used as control as the denomenator of the foldchange ratio and the remaining files as RIP, the numarator of the foldchange ratio.

featureGRanges

GRanges of features as an optional argument for function to compute RPKM/FPKM just for those features without retrieving online annotations.

rpkmCutoff

Cutoff for RPKM in RIP above which the genes will be reported if the fcCutoff is also satisfied (Default: 0.4).

fcCutoff

Cutoff for foldchange in RIP relative to the control above which the genes will be reported if the rpkmCutoff is also satisfied (Default: 3).

moreRIPGeneInfo

Binary indicator to indicate whether to download more information for each genes/transcripts rather than having only the gene/transcript IDs (Default: TRUE).

idType

A character string that specifies the type of the annotations, which can "ensembl_transcript_id" (Default), "ensembl_gene_id", "ucsc", etc. Refer to listFilters for more information.

myMin

Add a small value to both the numerator and denomenator as "pseudocount" to prevent the case where the denomenator is zero and the ratio becomes infinity regardless the value of the numerator (Default: .Machine$double.xmin).

saveRData

Path of output RData and tab-delim results.

...

Extra arguments passed to computeRPKM and/oruseMart.

Value

A list containing the following items:
nRPKMRPKM of genes in RIP or treatment condition ('n' stands for numerator in the foldchange ratio).
dRPKMRPKM of genes in control condition ('d' stands for denomenator in the foldchange ratio)
rpkmDFData frame containing read count, RPKM for the RIP (or treatment) and control, foldchange, and optional gene information including "chromosome_name", "start_position", "end_position", "strand", "external_gene_id", "ensembl_transcript_id", "ensembl_gene_id", "ucsc", "description"
rpkmCutoffCutoff used for RPKM as book keeping value.
fcCutoffCutoff used for foldchange as book keeping value.
featureGRangesGRanges object of the features for which the RPKM and foldchange are computed.

Details

The function uses computeRPKM to download annotation and compute RPKM/FPKM of the annotated genes in the list of files. The alignments file are separated into control as identified by the "cNAME" and the RIP (or any treatment) that do not have the cNAME in their file names. The alignments in either group are pooled together. If moreRIPGeneInfo is specified, the function witll query the Ensembl database. The chromosome ID in the numerical format used in Ensembl is prefixed with "chr" and the strand 1 and -1 converted to + and - for convenience.

References

Zhao, J., Ohsumi, T. K., Kung, J. T., Ogawa, Y., Grau, D. J., Sarma, K., Song, J. J., et al. (2010). Genome-wide Identification of Polycomb-Associated RNAs by RIP-seq. Molecular Cell, 40(6), 939D953. doi:10.1016/j.molcel.2010.12.011

M. Carlson, H. Pages, P. Aboyoun, S. Falcon, M. Morgan, D. Sarkar and M. Lawrence. GenomicFeatures: Tools for making and manipulating transcript centric annotations. R package version 1.8.2.

P. Aboyoun, H. Pages and M. Lawrence (). GenomicRanges: Representation and manipulation of genomic intervals. R package version 1.8.9.

Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Steffen Durinck, Paul T. Spellman, Ewan Birney and Wolfgang Huber, Nature Protocols 4, 1184-1191 (2009).

BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Steffen Durinck, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma and Wolfgang Huber, Bioinformatics 21, 3439-3440 (2005).

Martin Morgan and Herv'e Pag`es (). Rsamtools: Binary alignment (BAM), variant call (BCF), or tabix file import. R package version 1.8.5. http://bioconductor.org/packages/release/bioc/html/Rsamtools.html

Examples

Run this code

if(interactive()) {
	
# Retrieve system files
extdata.dir <- system.file("extdata", package="RIPSeeker") 

bamFiles <- list.files(extdata.dir, ".bam$", recursive=TRUE, full.names=TRUE)

bamFiles <- grep("PRC2", bamFiles, value=TRUE)

cNAME <- "SRR039214" 						# specify control name

# output file directory
outDir <- paste(getwd(), "ripSeek_example")

# use biomart
txDbName <- "biomart"
biomart <- "ENSEMBL_MART_ENSEMBL"		# use archive to get ensembl 65
dataset <- "mmusculus_gene_ensembl"		
host <- "dec2011.archive.ensembl.org" 	# use ensembl 65 for annotation

resultlist <- rulebaseRIPSeek(bamFiles, "SRR039214", dataset=dataset, 
		txDbName=txDbName, biomart=biomart, host=host, by="tx")

}

Run the code above in your browser using DataLab