pairwiseAlignment(pattern, subject, ...)
"pairwiseAlignment"(pattern, subject, patternQuality=PhredQuality(22L), subjectQuality=PhredQuality(22L), type="global", substitutionMatrix=NULL, fuzzyMatrix=NULL, gapOpening=10, gapExtension=4, scoreOnly=FALSE)
"pairwiseAlignment"(pattern, subject, type="global", substitutionMatrix=NULL, fuzzyMatrix=NULL, gapOpening=10, gapExtension=4, scoreOnly=FALSE)
XString
, or
an XStringSet
object.XString
, or
an XStringSet
object of length 1.XStringQuality
representing the respective quality scores for
pattern
and subject
that are used in a quality-based method
for generating a substitution matrix. These two arguments are ignored if
!is.null(substitutionMatrix)
or if its respective string set
(pattern
, subject
) is of class
QualityScaledXStringSet
."global"
, "local"
,
"overlap"
, "global-local"
, and "local-global"
where
"global"
= align whole strings with end gap penalties,
"local"
= align string fragments,
"overlap"
= align whole strings without end gap penalties,
"global-local"
= align whole strings in pattern
with
consecutive subsequence of subject
,
"local-global"
= align consecutive subsequence of pattern
with whole strings in subject
.patternQuality
and subjectQuality
arguments.scoreOnly == FALSE
, an instance of class
PairwiseAlignments
or
PairwiseAlignmentsSingleSubject
is returned.
If scoreOnly == TRUE
, a numeric vector containing the scores for the
optimal pairwise alignments is returned.
"Phred"
quality measures
$Q$ in $[0, 99]$, these error probabilities are given by
$\epsilon_i = 10^{-Q/10}$. For "Solexa"
quality measures $Q$ in
$[-5, 99]$, they are given by $\epsilon_i = 1 - 1/(1 + 10^{-Q/10})$.
Assuming independence within and between base reads, the combined error
probability of a mismatch when the underlying bases do match is
$\epsilon_c = \epsilon_1 + \epsilon_2 - (n/(n-1)) * \epsilon_1 * \epsilon_2$,
where $n$ is the number of letters in the underlying alphabet (i.e.
$n = 4$ for DNA input, $n = 20$ for amino acid input, otherwise
$n$ is the number of distinct letters in the input).
Using $\epsilon_c$, the substitution score is given by
$b * \log_2(\gamma_{x,y} * (1 - \epsilon_c) * n + (1 - \gamma_{x,y}) * \epsilon_c * (n/(n-1)))$,
where $b$ is the bit-scaling for the scoring and $\gamma_{x,y}$ is the
probability that characters $x$ and $y$ represents the same underlying
information (e.g. using IUPAC, $\gamma_{A,A} = 1$ and $\gamma_{A,N} = 1/4$.
In the arguments listed above fuzzyMatch
represents $\gamma_{x,y}$
and patternQuality
and subjectQuality
represents $\epsilon_1$
and $\epsilon_2$ respectively.If scoreOnly == FALSE
, a pairwise alignment with the maximum alignment
score is returned. If more than one pairwise alignment produces the maximum
alignment score, then the alignment with the smallest initial deletion whose
mismatches occur before its insertions and deletions is chosen. For example,
if pattern = "AGTA"
and subject = "AACTAACTA"
, then the alignment
pattern: [1] AG-TA; subject: [1] AACTA
is chosen over
pattern: [1] A-GTA; subject: [1] AACTA
or
pattern: [1] AG-TA; subject: [5] AACTA
if they all achieve the maximum
alignment score.
B. Haubold, T. Wiehe, Introduction to Computational Biology, Birkhauser Verlag 2006, Chapter 2.
K. Malde, The effect of sequence quality on sequence alignment, Bioinformatics 2008 24(7):897-900.
writePairwiseAlignments
,
stringDist
,
PairwiseAlignments-class,
XStringQuality-class,
substitution.matrices,
matchPattern
## Nucleotide global, local, and overlap alignments s1 <- DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG") s2 <- DNAString("GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC") # First use a fixed substitution matrix mat <- nucleotideSubstitutionMatrix(match = 1, mismatch = -3, baseOnly = TRUE) globalAlign <- pairwiseAlignment(s1, s2, substitutionMatrix = mat, gapOpening = 5, gapExtension = 2) localAlign <- pairwiseAlignment(s1, s2, type = "local", substitutionMatrix = mat, gapOpening = 5, gapExtension = 2) overlapAlign <- pairwiseAlignment(s1, s2, type = "overlap", substitutionMatrix = mat, gapOpening = 5, gapExtension = 2) # Then use quality-based method for generating a substitution matrix pairwiseAlignment(s1, s2, patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))), subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))), scoreOnly = TRUE) # Now assume can't distinguish between C/T and G/A pairwiseAlignment(s1, s2, patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))), subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))), type = "local") mapping <- diag(4) dimnames(mapping) <- list(DNA_BASES, DNA_BASES) mapping["C", "T"] <- mapping["T", "C"] <- 1 mapping["G", "A"] <- mapping["A", "G"] <- 1 pairwiseAlignment(s1, s2, patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))), subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))), fuzzyMatrix = mapping, type = "local") ## Amino acid global alignment pairwiseAlignment(AAString("PAWHEAE"), AAString("HEAGAWGHEE"), substitutionMatrix = "BLOSUM50", gapOpening = 0, gapExtension = 8)