calcDenovo
estimates expression of gene splicing variants,
considering both known variants and variants that have not been
previously described.
calcDenovo(distrs, targetGenomeDB, knownGenomeDB=targetGenomeDB, pc,
readLength, islandid, priorq=3, mprior, minpp=0.001, selectBest=FALSE,
searchMethod="submodels", niter, exactMarginal=TRUE,
integrateMethod="plugin", verbose=TRUE, mc.cores=1)
getDistrs
functionannotatedGenome
object with isoforms we wish to
quantify. By default these are the same as in knownGenomeDB
,
but more typically targetGenomeDB
is imported from a .gtf file
produced by some isoform prediction software.annotatedGenome
object with known isoforms,
e.g. from UCSC or GENCODE annotations. Used to set the prior
probability that any given isoform is expressed. knownGenome
should be the same genome annotations used to create argument
mprior
(when provided)pathCounts
readLength=75
.priorq=3
as this defines a non-local
prior that penalizes falsely predicted isoforms that show low expression.modelPrior
, used to favor isoforms
consistent with knownGenomeDB
. If left
missing it is estimated from knownGenomeDB
.
See details.minpp
are not reported. This argument can help reduce substantially the amount of
required memory to store the results.TRUE
only the model with highest
posterior probability is reported. While this can save memory, we do
not recommend this option as it may ignore a substantial amount of
uncertainty."allmodels"
enumerates all possible models (warning:
this is not feasible for genes with >5 exons). "rwmcmc"
uses a
random-walk MCMC scheme to focus on models with high posterior
probability.
"submodels"
considers that some isoforms in targetGenomeDB
may not be expressed, but does not search for new variants.
"auto"
uses "allmodels"
for genes with up to 5
exons and "rwmcmc"
for longer genes. See details.FALSE
to estimate posterior model
probabilities as the proportion of MCMC visits. Set to TRUE
to
use the integrated likelihoods (default). See details.'plugin'
) evaluates likelihood*prior at the posterior mode and
is the faster option. Set 'Laplace'
for Laplace approximations
and 'IS'
for Importance Sampling. The latter increases
computation cost very substantially.TRUE
to display progress information.multicore
is
available for your system.
Warning: using multiple processors substantially increases the memory
requirements, so set this value carefully.calcDenovo
explores which subset of the isoforms indicated in
targetGenomeDB
are truly expressed.
It also adds new isoforms when some reads follow an exon path that
is not possible under any of the isoforms in targetGenomeDB
.
calcDenovo
the posterior probability of each model
(i.e. configuration of expressed variants) via Bayes theorem.P(model|y) "proportional to" m(y|model) P(model)
where m(y|model) is the integrated likelihood and P(model) is the
prior probability of the model.
For example, a gene with 20 predicted isoforms in targetGenome
gives rise 2^20 - 1 possible models (configurations of expressed isoforms).
Importantly, P(model) can be set by analyzing available genome
annotations in knownGenomeDB
.
For instance, genes with 20 exons have isoforms that tend
to use most of the 20 exons. They also tend to express more
isoforms than genes with 5 exons. The function modelPrior
analyzes knownGenomeDB
to set reasonable values for P(model).
An exhaustive enumeration of all possible models is
not feasible unless the gene is very short (e.g. around 5 exons).
For longer genes we use computational strategies to search a subset of
"interesting" models. This is controlled by the argument searchMethod
(see above).
In order to compute P(model|y) one can either use the computed
m(y|model) P(model) (option exactMarginal==TRUE
) or the
proportion of MCMC visits (option exactMarginal==FALSE
). Unless
niter
is large the former option typically provides more
precise estimates.
denovoExpr
to obtain expression estimates from the
calcDenovo
output.
plotExpr
to produce a plot with splicing variants and estimated expression.