selex.mm: Build or retrieve a Markov model

Description

A function used to compute and store a Markov model built by training and cross-validating on specified files. It returns a Markov model handle to conveniently reference the calculated model in other SELEX functions.

Usage

selex.mm(sample, order=NA, crossValidationSample=NULL, Kmax= NULL,  seqfilter=NULL, mmMethod="DIVISION", mmWithLeftFlank=FALSE)

Arguments

sample

A sample handle to the training dataset.

order

The order of the Markov model to be built. If NA, builds Markov models of all orders from 0 to Kmax-1, selecting the one with highest R^2.

crossValidationSample

A sample handle to the cross-validation dataset. If NULL, no R^2 value will be provided.

Kmax

The K-mer length to determine model fit on. If NA, automatically finds kmax with a minimum count set to 100. See selex.kmax and `Details'.

seqfilter

A sequence filter object to include/exclude sequences that are read in from the FASTQ file.

mmMethod

A character string indicating the algorithm used to evaluate the Markov model conditional probabilities. Can be either "DIVISION" (default) or "TRANSITION". See `Details'.

mmWithLeftFlank

Predict expected counts by considering the sequences in the left flank of the variable region. See `Details'.

Value

selex.mm returns a Markov model handle.

Details

selex.mm builds an N-th order Markov model by training the model on the sample dataset and tests its accuracy by attempting to predict the K-mer counts of the cross-validation dataset. This K-mer length is determined either by setting Kmax or by an internal call to selex.kmax. If a cross-validation dataset does not exist, selex.split can be used to partition a dataset into testing and training datasets. The Markov model uses conditional probabilities to predict the probability of observing a given K-mer. For example, let us consider the following 6-mer sequence: AGGCTA If a fourth-order Markov model were to predict the prior observation probability of the above sequence, the following would have to be evaluated: $$p(AGGCTA) = p(AGG)~p(C|AGG)~p(T|GGC)~p(A|GCT)$$ These conditional probabilities can be evaluated using one of two algorithms. The TRANSITION algorithm relies on K-mer counts of length equal to the order of the Markov model. For the example above, $$p(C|AGG) = \frac{count(AGGC)}{count(AGGA)+count(AGGC)+count(AGGG)+count(AGGT)}$$ The DIVISION algorithm relies on K-mer frequencies for lengths equal to Markov model order and order-1. For the example above, $$p(C|AGG) = \frac{P(AGGC)}{P(AGG)} = \frac{\frac{count(AGGC)}{total 4-mer counts}}{\frac{(count(AGG)}{total 3-mer counts}} $$ The the flanking sequences in the left flank of the variable region can be taken into consideration when predicting prior observation probabilities by setting mmWithLeftFlank to TRUE. If the left barcode of the sequence in the example above is TGG, P(AGG) in the prior probability becomes $$p(AGG) = p(A|TGG)~p(G|GGA)~p(G|GAG).$$ After computation, the Markov model is stored in the working directory. When a new seqfilter object is provided, the Markov model is reconstructed. See selex.seqfilter for more details. See `References' for more details regarding the model construction process. selex.mmSummary can be used to view the R^2 values for all orders that have been computed for the Markov model. If crossValidationSample is NULL, the resulting Markov model will not be displayed by selex.mmSummary.

References

Slattery, M., Riley, T.R., Liu, P., Abe, N., Gomez-Alcala, P., Dror, I., Zhou, T., Rohs, R., Honig, B., Bussemaker, H.J.,and Mann, R.S. (2011) Cofactor binding evokes latent differences in DNA binding specificity between Hox proteins. Cell 147:1270--1282. Riley, T.R., Slattery, M., Abe, N., Rastogi, C., Liu, D., Mann, R.S., and Bussemaker, H.J. (2014) SELEX-seq: a method for characterizing the complete repertoire of binding site preferences for transcription factor complexes. Methods Mol. Biol. 1196:255--278.

Examples

Run this code


mm = selex.mm(sample=r0.split$train, order=NA, crossValidationSample=r0.split$test)

# View Markov model R^2 values
selex.mmSummary()

Run the code above in your browser using DataLab