nsamples
new RNA-seq samples, what data
would we expect to see? The simulation is posterior predictive,
i.e. based on the current available data x
.
simMultSamples(nsim, nsamples, nreads, readLength, fragLength, x,
groups='group', distrs, genomeDB, model='LNNMV', verbose=TRUE, mc.cores=1)
nsamples=c(5,5)
to simulate 5 new samples for 2 groups.readLength=100
.distrs
ExpressionSet
containing pilot data. x[[group]]
indicates groups to be comparedpData(x)
indicating the groups'LNNMV'
to simulate from log-normal normal
with modified variance model (Yuan and Kendsiorski, 2006),
or to 'GaGa'
to simulate from the
GaGa model (Rossell, 2009). See details.TRUE
to print progressmc.cores>1
requires package parallel
simulatedSamples
, which extends a list
of length nsim
. See the class documentation for some helpful
methods (e.g. coef, exprs, mergeBatches).
Each element is itself a list containing an individual simulation.data.frame
indicating the mean and
standard deviation of the Normal distribution used to generate data
from each groupExpressionSet
with Casper expression estimates,
as returned by calcExp
. pData(simExpr)
indicates group
information, and fData(simExpr)
the number of simulated reads
for each sample (in columns 'explCnts'
) and across all samples (in column 'readCount'
) 1. Simulate true mean expression in each group and residual variance
for each gene. If model=='LNNMV' this is based on the log-normal
normal with modified variance model in package EBarrays
(Yuan & Kendziorski 2006), if model=='GaGa' this is based on the GaGa
model (Rossell, 2009).
adapted to take into account that the expression estimates in the
pilot data x
are noisy (which is why simMultSamples
requires the
SE / posterior SD associated to exprs(x)
).
The simulated values are returned in component "simTruth"
of
the simMultSamples
output.
2. Simulate true isoform expression for each of the future samples. These are independent Normal draws with mean and variance generated in step 1. True gene expression is derived from the isoform expressions.
3. Determine the number of reads to be simulated for each gene based on its true expression (generated in step 2) and a Multinomial sampling model. For each sample:
- The number of reads yielded by the experiment is Unif(.8*nreads,1.2*nreads)
- A proportion of non-mappable reads is discarded using the power law in Li
et al (2014)
- Amongst remaining reads, we assume that a proportion Unif(0.6,0.9) were aligned
(consistenly with reports from ENCODE project)
The final number of simulated reads is reported in component "simExpr"
of the simMultSamples
output.
4. Obtain expression estimates from the path counts produced in step 3
via calcExp
. These are reported in component "simExpr"
of the simMultSamples
output.
Stephan-Otto Attolini C., Pena V., Rossell D. Bayesian designs for personalized alternative splicing RNA-seq studies (2015)
Yuan, M. and Kendziorski, C. (2006). A unified approach for simultaneous gene clustering and differential expression identification. Biometrics, 62, 1089-1098.
#Run casperDesign() to see full manual with examples
Run the code above in your browser using DataLab