selex.mm(sample, order=NA, crossValidationSample=NULL, Kmax= NULL, seqfilter=NULL, mmMethod="DIVISION", mmWithLeftFlank=FALSE)
NA
, builds Markov models of all orders from 0
to Kmax-1
, selecting the one with highest R^2.NULL
, no R^2 value will be provided.NA
, automatically finds kmax with a minimum count set to 100. See selex.kmax
and `Details'."DIVISION"
(default) or "TRANSITION"
. See `Details'.selex.mm
returns a Markov model handle.
selex.mm
builds an N-th order Markov model by training the model on the sample
dataset and tests its accuracy by attempting to predict the K-mer counts of the cross-validation dataset. This K-mer length is determined either by setting Kmax
or by an internal call to selex.kmax
. If a cross-validation dataset does not exist, selex.split
can be used to partition a dataset into testing and training datasets.
The Markov model uses conditional probabilities to predict the probability of observing a given K-mer. For example, let us consider the following 6-mer sequence:
AGGCTA
If a fourth-order Markov model were to predict the prior observation probability of the above sequence, the following would have to be evaluated:
$$p(AGGCTA) = p(AGG)~p(C|AGG)~p(T|GGC)~p(A|GCT)$$
These conditional probabilities can be evaluated using one of two algorithms. The TRANSITION
algorithm relies on K-mer counts of length equal to the order of the Markov model. For the example above,
$$p(C|AGG) = \frac{count(AGGC)}{count(AGGA)+count(AGGC)+count(AGGG)+count(AGGT)}$$
The DIVISION
algorithm relies on K-mer frequencies for lengths equal to Markov model order and order-1. For the example above,
$$p(C|AGG) = \frac{P(AGGC)}{P(AGG)} = \frac{\frac{count(AGGC)}{total 4-mer counts}}{\frac{(count(AGG)}{total 3-mer counts}} $$
The the flanking sequences in the left flank of the variable region can be taken into consideration when predicting prior observation probabilities by setting mmWithLeftFlank
to TRUE
. If the left barcode of the sequence in the example above is TGG, P(AGG) in the prior probability becomes
$$p(AGG) = p(A|TGG)~p(G|GGA)~p(G|GAG).$$
After computation, the Markov model is stored in the working directory. When a new seqfilter
object is provided, the Markov model is reconstructed. See selex.seqfilter
for more details. See `References' for more details regarding the model construction process.
selex.mmSummary
can be used to view the R^2 values for all orders that have been computed for the Markov model. If crossValidationSample
is NULL
, the resulting Markov model will not be displayed by selex.mmSummary
.selex.counts
, selex.infogain
, selex.kmax
, selex.mmSummary
, selex.run
, selex.split
mm = selex.mm(sample=r0.split$train, order=NA, crossValidationSample=r0.split$test)
# View Markov model R^2 values
selex.mmSummary()
Run the code above in your browser using DataLab