motifbreakR(snpList, pwmList, threshold = 0.85, filterp = FALSE, method = "default", show.neutral = FALSE, verbose = FALSE, bkg = c(A = 0.25, C = 0.25, G = 0.25, T = 0.25), BPPARAM = bpparam())
snps.from.rsid
or snps.from.file
MotifList
containing the motifs that
you wish to interrogatedefault
, log
, ic
, or notrans
; see
details.log
method=ic
register
and see getClass("BiocParallelParam")
for additional parameter
classes. Try BiocParallel::registered()
to see what's availible and
for example BiocParallel::bpparam("SerialParam")
would allow serial
evaluation.filterp = FALSE
,
this is the value that is thresholded.filterp = FALSE
,
this is the value that is thresholded.NA
. see calculatePvalue
for more informationNA
. see calculatePvalue
for more informationmotifPos
in the motifmotifPos
in the motifplotMB
method
argument.Suppose we have a frequency matrix $M$ of width $n$ (i.e. a PPM as described above). Furthermore, we have a sequence $s$ also of length $n$, such that $s_i in {A,T,G,C}, i = 1 \ldots n$. Each column of $M$ contains the frequencies of each letter in each position.
Commonly in the literature sequences are scored as the sum of log probabilities:
Equation 1
$$F( s,M ) = \sum_{i = 1}^{n}{\log( \frac{M_{s_{i},i}}{b_{s_{i}}} )}$$
where $b_s_i$ is the background frequency of letter $s_i$ in
the genome of interest. This method can be specified by the user as
method='log'
.
As an alternative to this method, we introduced a scoring method to
directly weight the score by the importance of the position within the
match sequence. This method of weighting is accessed by specifying
method='ic'
(information content). A general representation
of this scoring method is given by:
Equation 2
$$F( s,M ) = p_{s} \cdot \omega_{M}$$
where $p_s$ is the scoring vector derived from sequence $s$ and matrix $M$, and $w_M$ is a weight vector derived from $M$. First, we compute the scoring vector of position scores $p$
Equation 3
$$p_{s} = ( M_{s_{i},i} ) \textrm{\ \ \ where\ \ \ } \frac{i = 1,\ldots n}{s_{i} \in \{ A,C,G,T \}}$$
and second, for each $M$ a constant vector of weights $\omega_M = ( \omega_1, \omega_2, \ldots, \omega_n)$.
There are two methods for producing $\omega_M$. The first, which we call weighted sum, is the difference in the probabilities for the two letters of the polymorphism (or variant), i.e. $\Delta p_s_i$, or the difference of the maximum and minimum values for each column of $M$:
Equation 4.1
$$\omega_{i} = \max \{ M_{i} \} - \min \{ M_{i} \}\textrm{\ \ \ \ where\ \ \ \ \ \ }i = 1,\ldots n$$
The second variation of this theme is to weight by relative entropy. Thus the relative entropy weight for each column $i$ of the matrix is given by:
Equation 4.2
$$\omega_{i} = \sum_{j \in \{ A,C,G,T \}}^{}{M_{j,i}\log_2( \frac{M_{j,i}}{b_{i}} )}\textrm{\ \ \ \ \ where\ \ \ \ \ }i = 1,\ldots n$$
where $b_i$ is again the background frequency of the letter $i$.
Thus, there are 3 possible algorithms to apply via the method
argument. The first is the standard summation of log probabilities
(method='log'
). The second and third are the weighted sum and
information content methods (method='default'
and method='ic'
) specified by
equations 4.1 and 4.2, respectively. motifbreakR assumes a
uniform background nucleotide distribution ($b$) in equations 1 and
4.2 unless otherwise specified by the user. Since we are primarily
interested in the difference between alleles, background frequency is
not a major factor, although it can change the results. Additionally,
inclusion of background frequency introduces potential bias when
collections of motifs are employed, since motifs are themselves
unbalanced with respect to nucleotide composition. With these cautions
in mind, users may override the uniform distribution if so desired. For
all three methods, motifbreakR scores and reports the reference
and alternate alleles of the sequence
($F( s_ref,M )$ and
$F( s_alt,M )$), and provides the matrix scores
$p_s_ref$ and $p_s_alt$ of the SNP (or
variant). The scores are scaled as a fraction of scoring range 0-1 of
the motif matrix, $M$. If either of
$F( s_ref,M )$ and
$F( s_alt,M )$ is greater than a user-specified
threshold (default value of 0.85) the SNP is reported. By default
motifbreakR does not display neutral effects,
($\Delta p_i < 0.4$) but this behaviour can be
overridden.
Additionally, now, with the use of TFMPvalue-package
, we may filter by p-value of the match.
This is unfortunately a two step process. First, by invoking filterp=TRUE
and setting a threshold at
a desired p-value e.g 1e-4, we perform a rough filter on the results by rounding all values in the PWM to two
decimal place, and calculating a scoring threshold based upon that. The second step is to use the function calculatePvalue()
on a selection of results which will change the Refpvalue
and Altpvalue
columns in the output from NA
to the p-value
calculated by TFMsc2pv
. This can be (although not always) a very memory and time intensive process if the algorithm doesn't converge rapidly.
snps.from.rsid
and snps.from.file
for
information about how to generate the input to this function and
plotMB
for information on how to visualize it's output
library(BSgenome.Hsapiens.UCSC.hg19)
# prepare variants
load(system.file("extdata",
"pca.enhancer.snps.rda",
package = "motifbreakR")) # loads snps.mb
pca.enhancer.snps <- sample(snps.mb, 20)
# Get motifs to interrogate
data(hocomoco)
motifs <- sample(hocomoco, 50)
# run motifbreakR
results <- motifbreakR(pca.enhancer.snps,
motifs, threshold = 0.85,
method = "ic",
BPPARAM=BiocParallel::SerialParam())
Run the code above in your browser using DataLab