wrapKnown: Run all necessary steps to get expression estimates from multiple bam files with the casper pipeline.

Description

Function to analyze bam files to generate an ExpressionSet with expression estimates for all samples, read start and fragment length distributions, path counts and optinally processed reads.

Usage

wrapKnown(bamFile, verbose=FALSE, seed=1, mc.cores.int=1,
mc.cores=1, genomeDB, readLength, rpkm=TRUE, priorq=2, priorqGeneExpr=2,
citype='none', niter=10^3, burnin=100, keep.pbam=FALSE,
keep.multihits=TRUE, chroms=NULL)

Arguments

bamFile

Names of bam files with the sample to analyze. These must sorted and indexed, and the index must be in the same directory.

verbose

Set to TRUE to display progress information.

seed

Set seed of random number generator.

mc.cores.int

Number of cores to use when loading bam files. This is a memory intensive step, therefore number of cores must be chosen according to available RAM memory.

mc.cores

Number of cores to use in expression estimation.

genomeDB

annotatedGenome object containing annotated genome, as returned by the procGenome function.

readLength

Read length in bp, e.g. in a paired-end experiment where 75bp are sequenced on each end one would set readLength=75.

rpkm

Set to TRUE to return reads per kilobase per million (RPKM). Set to FALSE to return relative expression levels. Important, relative expression adds up to 1 within gene island, NOT within gene. To get relative expressions within gene run relexprByGene afterwards. See details.

priorq

Parameter of the prior distribution on the proportion of reads coming from each variant. The prior is Dirichlet with prior sample size for each variant equal to priorq. We recommend priorq=2 for estimation, as it pools the estimated expression away from 0 and 1 and returned lower estimation errors than priorq=1 in our simulated experiments.

priorqGeneExpr

Parameter for prior distribution on overall gene expression. Defaults to 2, which ensures non-zero estimates for all genes

citype

Set to "none" to return no credibility intervals. Set to "asymp" to return approximate 95% CIs (obtained via the delta method). Set to "exact" to obtain exact CIs via Monte Carlo simulation. Options "asymp" and especially "exact" can increase the computation time substantially.

niter

Number of Monte Carlo iterations. Only used when citype=="exact".

burnin

Number of burnin Monte Carlo iterations. Only used when citype=="exact".

keep.pbam

Set to TRUE to save processed bam object, as returned by procBam. This object can require substantial memory during execution and disk storage upon saving and is not needed for a default analysis.

keep.multihits

Set to FALSE to discard reads aligned to multiple positions.

chroms

Manually set chromosomes to be processed. By default only main chromosomes are considered (except 'chrM')

Value

distr: Object of class readDistrs
pbam: List of objects of class procBam with one element per chromosome
pc: Object of class pathCounts
exp: Object of class ExpressionSet

Details

The function executes the functions procBam, getDistrs and pathCounts in parallel for each chromosome, but is much more efficient in cpu speed and memory usage than running these functions separately. Data from multiple samples are then combined using mergeExp. Note that further normalization (e.g. quantileNorm) may be needed preliminary to actual data analysis.

When rpkm is false the function returns the estimated proportion of reads arising from each isoform within a gene island. casper groups two or more genes into a gene island whenever these genes share an exon (or part of an exon). Because exons are shared, isoform quantification must be done simultaneously for all those genes.

That is, the output from wrapKnown when rpkm is FALSE are proportions that add up to 1 within each island. If you would like to re-normalize these expressions so that they add up to 1 within each gene, see the help for function relexprByGene.

One last remark: casper returns the estimated proportion of reads generated by each isoform, which is not the same as relative isoform expressions. Longer isoforms tend to produce more reads than shorter isoforms. This is easily accounted for by dividing relative expressions by isoform length, see relexprByGene.

References

Rossell D, Stephan-Otto Attolini C, Kroiss M, Stocker A. Quantifying Alternative Splicing from Paired-End RNA-sequencing data. Annals of Applied Statistics, 8(1):309-330.

Examples

Run this code

## genDB<-makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
## hg19DB <- procGenome(genDB, "hg19")
##  bamFile="/path_to_bam/sorted.bam"
## ans <- wrapKnown(bamFile=bamFile, mc.cores.int=4, mc.cores=3, genomeDB=hg19DB, readLength=101)
##  names(ans)
##  head(exprs(ans\$exp))

Run the code above in your browser using DataLab