motifbreakR(snpList, pwmList, threshold = 0.85, filterp = FALSE,
method = "default", show.neutral = FALSE, verbose = FALSE, bkg = c(A =
0.25, C = 0.25, G = 0.25, T = 0.25), BPPARAM = bpparam())
snps.from.rsid
or snps.from.file
MotifList
containing the motifs that
you wish to interrogatedefault
, log
, ic
, or notrans
; see
details.log
method=ic
register
and see getClass("BiocParallelParam")
for additional parameter
classes. Try BiocParallel::registered()
to see what's availible and
for example BiocParallel::bpparam("SerialParam")
would allow serial
evaluation.filterp = FALSE
,
this is the value that is thresholded.filterp = FALSE
,
this is the value that is thresholded.NA
. see calculatePvalue
for more informationNA
. see calculatePvalue
for more informationmotifPos
in the motifmotifPos
in the motifplotMB
method
argument.Suppose we have a frequency matrix $M$ of width $n$ (i.e. a PPM as described above). Furthermore, we have a sequence $s$ also of length $n$, such that $s_{i} \in { A,T,C,G }, i = 1,\ldots n$. Each column of $M$ contains the frequencies of each letter in each position.
Commonly in the literature sequences are scored as the sum of log probabilities:
Equation 1
$$F( s,M ) = \sum_{i = 1}^{n}{\log( \frac{M_{s_{i},i}}{b_{s_{i}}} )}$$
where $b_{s_{i}}$ is the background frequency of letter $s_{i}$ in
the genome of interest. This method can be specified by the user as
method='log'
.
As an alternative to this method, we introduced a scoring method to
directly weight the score by the importance of the position within the
match sequence. This method of weighting is accessed by specifying
method='ic'
(information content). A general representation
of this scoring method is given by:
Equation 2
$$F( s,M ) = p_{s} \cdot \omega_{M}$$
where $p_{s}$ is the scoring vector derived from sequence $s$ and matrix $M$, and $w_{M}$ is a weight vector derived from $M$. First, we compute the scoring vector of position scores $p$
Equation 3
$$p_{s} = ( M_{s_{i},i} ) \textrm{\ \ \ where\ \ \ } \frac{i = 1,\ldots n}{s_{i} \in { A,C,G,T }}$$
and second, for each $M$ a constant vector of weights $\omega_{M} = ( \omega_{1},\omega_{2},\ldots,\omega_{n} )$.
There are two methods for producing $\omega_{M}$. The first, which we call weighted sum, is the difference in the probabilities for the two letters of the polymorphism (or variant), i.e. $\Delta p_{s_{i}}$, or the difference of the maximum and minimum values for each column of $M$:
Equation 4.1
$$\omega_{i} = \max { M_{i} } - \min { M_{i} }\textrm{\ \ \ \ where\ \ \ \ \ \ }i = 1,\ldots n$$
The second variation of this theme is to weight by relative entropy. Thus the relative entropy weight for each column $i$ of the matrix is given by:
Equation 4.2
$$\omega_{i} = \sum_{j \in { A,C,G,T }}^{}{M_{j,i}\log_2( \frac{M_{j,i}}{b_{i}} )}\textrm{\ \ \ \ \ where\ \ \ \ \ }i = 1,\ldots n$$
where $b_{i}$ is again the background frequency of the letter $i$.
Thus, there are 3 possible algorithms to apply via the method
argument. The first is the standard summation of log probabilities
(method='log'
). The second and third are the weighted sum and
information content methods (method='default'
and method='ic'
) specified by
equations 4.1 and 4.2, respectively.
Additionally, now, with the use of TFMPvalue-package
, we may filter by p-value of the match.
This is unfortunately a two step process. First, by invoking filterp=TRUE
and setting a threshold at
a desired p-value e.g 1e-4, we perform a rough filter on the results by rounding all values in the PWM to two
decimal place, and calculating a scoring threshold based upon that. The second step is to use the function calculatePvalue()
on a selection of results which will change the Refpvalue
and Altpvalue
columns in the output from NA
to the p-value
calculated by TFMsc2pv
. This can be (although not always) a very memory and time intensive process if the algorithm doesn't converge rapidly.
snps.from.rsid
and snps.from.file
for
information about how to generate the input to this function and
plotMB
for information on how to visualize it's outputlibrary(BSgenome.Hsapiens.UCSC.hg19)
# prepare variants
load(system.file("extdata",
"pca.enhancer.snps.rda",
package = "motifbreakR")) # loads snps.mb
pca.enhancer.snps <- sample(snps.mb, 20)
# Get motifs to interrogate
data(hocomoco)
motifs <- sample(hocomoco, 50)
# run motifbreakR
results <- motifbreakR(pca.enhancer.snps,
motifs, threshold = 0.85,
method = "ic",
BPPARAM=BiocParallel::SerialParam())
Run the code above in your browser using DataLab