simMultSamples: Simulate paired end reads for multiple future samples based on pilot data, and obtain their expression estimates via casper

Description

Simulate true expression levels and observed data (casper expression estimates) for future samples within each group. These simulations serve as the basis for sample size calculation: if one were to sequence nsamples new RNA-seq samples, what data would we expect to see? The simulation is posterior predictive, i.e. based on the current available data x.

Usage

simMultSamples(nsim, nsamples, nreads, readLength, fragLength, x,
groups='group', distrs, genomeDB, model='LNNMV', verbose=TRUE, mc.cores=1)

Arguments

nsim

Number of simulations to obtain

nsamples

Vector indicating number of future samples per group, e.g. nsamples=c(5,5) to simulate 5 new samples for 2 groups.

nreads

Desired number of paired-end reads per sample. The actual number of aligned reads for any given sample differs from this amount, see details.

readLength

Read length, i.e. in an experiment with paired reads at 100bp each, readLength=100.

fragLength

Desired average insert size (size of RNA molecules after fragmentation). If missing, insert sizes are as obtained from distrs

ExpressionSet containing pilot data. x[[group]] indicates groups to be compared

groups

Name of column in pData(x) indicating the groups

distrs

Fragment start and length distributions. It can be either an object or a list of objects of class readDistrs. In the latter case, an element is chosen at random for each individual sample to consider uncertainty in these distributions. If not specified, it defaults to data(distrsGSE37704).

genomeDB

annotatedGenome object

model

Set to 'LNNMV' to simulate from log-normal normal with modified variance model (Yuan and Kendsiorski, 2006), or to 'GaGa' to simulate from the GaGa model (Rossell, 2009). See details.

verbose

Set to TRUE to print progress

mc.cores

Number of cores to use in function. mc.cores>1 requires package parallel

Value

simTruth: data.frame indicating the mean and standard deviation of the Normal distribution used to generate data from each group
simExpr: ExpressionSet with Casper expression estimates, as returned by calcExp. pData(simExpr) indicates group information, and fData(simExpr) the number of simulated reads for each sample (in columns 'explCnts') and across all samples (in column 'readCount')

Details

The posterior predictive simulations is based on four steps: (1) simulate true expression for each group (mean and SD), (2) simulate true expression for future samples, (3) simulate paired reads for each future sample, (4) estimate expression from the reads via Casper. Below are some more details.

1. Simulate true mean expression in each group and residual variance for each gene. If model=='LNNMV' this is based on the log-normal normal with modified variance model in package EBarrays (Yuan & Kendziorski 2006), if model=='GaGa' this is based on the GaGa model (Rossell, 2009). adapted to take into account that the expression estimates in the pilot data x are noisy (which is why simMultSamples requires the SE / posterior SD associated to exprs(x)). The simulated values are returned in component "simTruth" of the simMultSamples output.

2. Simulate true isoform expression for each of the future samples. These are independent Normal draws with mean and variance generated in step 1. True gene expression is derived from the isoform expressions.

3. Determine the number of reads to be simulated for each gene based on its true expression (generated in step 2) and a Multinomial sampling model. For each sample:

- The number of reads yielded by the experiment is Unif(.8*nreads,1.2*nreads) - A proportion of non-mappable reads is discarded using the power law in Li et al (2014) - Amongst remaining reads, we assume that a proportion Unif(0.6,0.9) were aligned (consistenly with reports from ENCODE project) The final number of simulated reads is reported in component "simExpr" of the simMultSamples output. 4. Obtain expression estimates from the path counts produced in step 3 via calcExp. These are reported in component "simExpr" of the simMultSamples output.

References

Rossell D. (2009) GaGa: a Parsimonious and Flexible Model for Differential Expression Analysis. Annals of Applied Statistics, 3, 1035-1051.

Stephan-Otto Attolini C., Pena V., Rossell D. Bayesian designs for personalized alternative splicing RNA-seq studies (2015)

Yuan, M. and Kendziorski, C. (2006). A unified approach for simultaneous gene clustering and differential expression identification. Biometrics, 62, 1089-1098.

Examples

Run this code

#Run casperDesign() to see full manual with examples

Run the code above in your browser using DataLab