simMAE(nsim, islandid, nreads, readLength, fragLength, burnin=1000, pc, distr, readLength.pilot=readLength, eset.pilot, usePilot=FALSE, retTxsError=FALSE, genomeDB, mc.cores=1, mc.cores.int=1, verbose=FALSE, writeBam=FALSE, bamFile=NULL)nsim=10 suffice)islandid. When not specified genome-wide simulations are
performed.nreads to account for non-mappability and random read yield
(see details)calcExp)eset.pilotpc when not specified by the
user. See detailscasper assumes that the pilot data is from a related experiment rather than the current tissue of interest (usePilot=FALSE). Hence, the pilot data is used to simulate new RNA-seq data but not to estimate its expression. However, in some cases we may be interested in re-sequencing the pilot sample at deeper length, in which case one would want to combine the pilot data with the new data to obtain more precise estimates. This can be achieved by setting usePilot=TRUEretTxsError=TRUE, simMAE returns posterior expected MAE
for each individual isoform. This option is not available when
eset.pilot is specified instead of pc.
Else the output is a data.frame with overall MAE across all isoformsannotatedGenome object, as returned by
procGenomecalcExpverbose=TRUE to print progress informationTRUE to write simulated reads to a .bam fileretTxsError==TRUE, simMAE returns posterior expected MAE
for each individual isoform.
Else the output is a data.frame with overall MAE across all isoforms
simMAE simulates nsim datasets under each experimental
setting defined by nreads, readLength,
fragLength. For each dataset the following steps are performed:1. The number of reads is nreads * readYield * pmapped, where readYield= runif(1,0.8,1.2) accounts for deviations in read yield and pmapped= runif(1,0.6,0.9)*pmappable is the proportion of mapped reads (60%-90% of the mappable reads according to the piecewise-linear power law of Li et al (2014))
2. True expression levels pi are generated from their posterior
distribution given the pilot data.
3. Conditional on pi, RNA-seq data are generated and expression
estimates pihat are obtained using calcExp
4. The mean absolute estimation error sum(abs(pihat-pi)) across
all isoforms is computed
Ideally simMAE should use pilot data from a relevant related experiment
to simulate what future data may look like for the current experiment of
interest.
The recommened way to do this is to download a .bam file from such a
related experiment and processing it in casper with function
wrapKnown, as then both gene and isoform expression can be
estimated accurately.
The object output by wrapKnown is a list with elements
named 'pc', 'distr' which can be given as input to
simMAE.
As an alternative to specifying pc,
simMAE allows setting eset.pilot as
pilot data. Gene and isoform expression are then simulated as follows:
1. The number of reads per gene is generated from a Multinomial
distribution with success probabilities proportional to
2^exprs{eset.pilot}.
2. Relative isoform expression within each gene are generated from a symmetric Dirichlet distribution with parameter 1/Ig, where Ig is the number of isoforms in gene g.
We emphasize that relative isoform expressions are not trained from the
pilot data, and that while the distribution of gene expression levels
resembles that in eset.pilot, no attempt is made to match gene
identifiers and hence the results for individual genes should not be
trusted
(hence this option is only available when retTxsError==FALSE.
Li, W. and Freudenberg, J. and Miramontes, P. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinformatics, 15, 2 (2014)
## maybe str(simMAE) ; plot(simMAE) ...
Run the code above in your browser using DataLab