simMAE(nsim, islandid, nreads, readLength, fragLength, burnin=1000, pc, distr, readLength.pilot=readLength, eset.pilot, usePilot=FALSE, retTxsError=FALSE, genomeDB, mc.cores=1, mc.cores.int=1, verbose=FALSE, writeBam=FALSE, bamFile=NULL)
nsim=10
suffice)islandid
. When not specified genome-wide simulations are
performed.nreads
to account for non-mappability and random read yield
(see details)calcExp
)eset.pilot
pc
when not specified by the
user. See detailscasper
assumes that the pilot data is from a related experiment rather than the current tissue of interest (usePilot=FALSE
). Hence, the pilot data is used to simulate new RNA-seq data but not to estimate its expression. However, in some cases we may be interested in re-sequencing the pilot sample at deeper length, in which case one would want to combine the pilot data with the new data to obtain more precise estimates. This can be achieved by setting usePilot=TRUE
retTxsError=TRUE
, simMAE
returns posterior expected MAE
for each individual isoform. This option is not available when
eset.pilot
is specified instead of pc
.
Else the output is a data.frame
with overall MAE across all isoformsannotatedGenome
object, as returned by
procGenome
calcExp
verbose=TRUE
to print progress informationTRUE
to write simulated reads to a .bam fileretTxsError==TRUE
, simMAE
returns posterior expected MAE
for each individual isoform.
Else the output is a data.frame
with overall MAE across all isoforms
simMAE
simulates nsim
datasets under each experimental
setting defined by nreads
, readLength
,
fragLength
. For each dataset the following steps are performed:1. The number of reads is nreads * readYield * pmapped, where readYield= runif(1,0.8,1.2) accounts for deviations in read yield and pmapped= runif(1,0.6,0.9)*pmappable is the proportion of mapped reads (60%-90% of the mappable reads according to the piecewise-linear power law of Li et al (2014))
2. True expression levels pi
are generated from their posterior
distribution given the pilot data.
3. Conditional on pi
, RNA-seq data are generated and expression
estimates pihat
are obtained using calcExp
4. The mean absolute estimation error sum(abs(pihat-pi))
across
all isoforms is computed
Ideally simMAE
should use pilot data from a relevant related experiment
to simulate what future data may look like for the current experiment of
interest.
The recommened way to do this is to download a .bam file from such a
related experiment and processing it in casper with function
wrapKnown
, as then both gene and isoform expression can be
estimated accurately.
The object output by wrapKnown
is a list with elements
named 'pc'
, 'distr'
which can be given as input to
simMAE
.
As an alternative to specifying pc
,
simMAE
allows setting eset.pilot
as
pilot data. Gene and isoform expression are then simulated as follows:
1. The number of reads per gene is generated from a Multinomial
distribution with success probabilities proportional to
2^exprs{eset.pilot}
.
2. Relative isoform expression within each gene are generated from a symmetric Dirichlet distribution with parameter 1/Ig, where Ig is the number of isoforms in gene g.
We emphasize that relative isoform expressions are not trained from the
pilot data, and that while the distribution of gene expression levels
resembles that in eset.pilot
, no attempt is made to match gene
identifiers and hence the results for individual genes should not be
trusted
(hence this option is only available when retTxsError==FALSE
.
Li, W. and Freudenberg, J. and Miramontes, P. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinformatics, 15, 2 (2014)
## maybe str(simMAE) ; plot(simMAE) ...
Run the code above in your browser using DataLab