simMAE: Simulate Mean Absolute Error (MAE) in estimating isoform expression under various experimental settings.

Description

Simulate several future RNA-seq data under various experimental settings (sequencing depth, read length, insert sizes), estimate isoform expression and assess the MAE incurred in the estimation process. The function is a wrapper combining functions simReads and calcExp.

Usage

simMAE(nsim, islandid, nreads, readLength, fragLength, burnin=1000, pc, distr, readLength.pilot=readLength, eset.pilot, usePilot=FALSE, retTxsError=FALSE, genomeDB, mc.cores=1, mc.cores.int=1, verbose=FALSE, writeBam=FALSE, bamFile=NULL)

Arguments

nsim

Number of RNA-seq datasets to generate (often as little as nsim=10 suffice)

islandid

When specified this argument indicates to run the simulations only for gene islands with identifiers in islandid. When not specified genome-wide simulations are performed.

nreads

Vector indicating the target number of read pairs for each experimental setting. The actual number of reads differs from nreads to account for non-mappability and random read yield (see details)

readLength

Vector indicating the read length in each experimental setting

fragLength

Vector indicating the mean insert size in each experimental setting

burnin

Number of MCMC burn-in samples (passed on to calcExp)

Observed path counts in pilot data. When not specified, these are simulated from eset.pilot

distr

Estimated read start and insert size distributions in pilot data

readLength.pilot

Read length in pilot data

eset.pilot

ExpressionSet with pilot data expression in log2-RPKM, used to simulate pc when not specified by the user. See details

usePilot

By default casper assumes that the pilot data is from a related experiment rather than the current tissue of interest (usePilot=FALSE). Hence, the pilot data is used to simulate new RNA-seq data but not to estimate its expression. However, in some cases we may be interested in re-sequencing the pilot sample at deeper length, in which case one would want to combine the pilot data with the new data to obtain more precise estimates. This can be achieved by setting usePilot=TRUE

retTxsError

If retTxsError=TRUE, simMAE returns posterior expected MAE for each individual isoform. This option is not available when eset.pilot is specified instead of pc. Else the output is a data.frame with overall MAE across all isoforms

genomeDB

annotatedGenome object, as returned by procGenome

mc.cores

Number of cores to use in the expression estimation step, passed on to calcExp

mc.cores.int

Number of cores to simulate RNA-seq datasets in parallel

verbose

Set verbose=TRUE to print progress information

writeBam

Set to TRUE to write simulated reads to a .bam file

bamFile

Name of the .bam file

Value

If retTxsError==TRUE, simMAE returns posterior expected MAE for each individual isoform. Else the output is a data.frame with overall MAE across all isoforms

Details

simMAE simulates nsim datasets under each experimental setting defined by nreads, readLength, fragLength. For each dataset the following steps are performed:

1. The number of reads is nreads * readYield * pmapped, where readYield= runif(1,0.8,1.2) accounts for deviations in read yield and pmapped= runif(1,0.6,0.9)*pmappable is the proportion of mapped reads (60%-90% of the mappable reads according to the piecewise-linear power law of Li et al (2014))

2. True expression levels pi are generated from their posterior distribution given the pilot data.

3. Conditional on pi, RNA-seq data are generated and expression estimates pihat are obtained using calcExp

4. The mean absolute estimation error sum(abs(pihat-pi)) across all isoforms is computed

Ideally simMAE should use pilot data from a relevant related experiment to simulate what future data may look like for the current experiment of interest. The recommened way to do this is to download a .bam file from such a related experiment and processing it in casper with function wrapKnown, as then both gene and isoform expression can be estimated accurately. The object output by wrapKnown is a list with elements named 'pc', 'distr' which can be given as input to simMAE.

As an alternative to specifying pc, simMAE allows setting eset.pilot as pilot data. Gene and isoform expression are then simulated as follows:

1. The number of reads per gene is generated from a Multinomial distribution with success probabilities proportional to 2^exprs{eset.pilot}.

2. Relative isoform expression within each gene are generated from a symmetric Dirichlet distribution with parameter 1/Ig, where Ig is the number of isoforms in gene g.

We emphasize that relative isoform expressions are not trained from the pilot data, and that while the distribution of gene expression levels resembles that in eset.pilot, no attempt is made to match gene identifiers and hence the results for individual genes should not be trusted (hence this option is only available when retTxsError==FALSE.

References

Stephan-Otto Attolini C., Pena V., Rossell D. Bayesian designs for personalized alternative splicing RNA-seq studies (2014)

Li, W. and Freudenberg, J. and Miramontes, P. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinformatics, 15, 2 (2014)

Examples

Run this code

## maybe str(simMAE) ; plot(simMAE) ...

Run the code above in your browser using DataLab