pairwiseAlignment
Optimal Pairwise Alignment
Solves (NeedlemanWunsch) global alignment, (SmithWaterman) local alignment, and (endsfree) overlap alignment problems.
Usage
pairwiseAlignment(pattern, subject, ...)
"pairwiseAlignment"(pattern, subject, patternQuality=PhredQuality(22L), subjectQuality=PhredQuality(22L), type="global", substitutionMatrix=NULL, fuzzyMatrix=NULL, gapOpening=10, gapExtension=4, scoreOnly=FALSE)
"pairwiseAlignment"(pattern, subject, type="global", substitutionMatrix=NULL, fuzzyMatrix=NULL, gapOpening=10, gapExtension=4, scoreOnly=FALSE)
Arguments
 pattern
 a character vector of any length, an
XString
, or anXStringSet
object.  subject
 a character vector of length 1, an
XString
, or anXStringSet
object of length 1.  patternQuality, subjectQuality
 objects of class
XStringQuality
representing the respective quality scores forpattern
andsubject
that are used in a qualitybased method for generating a substitution matrix. These two arguments are ignored if!is.null(substitutionMatrix)
or if its respective string set (pattern
,subject
) is of classQualityScaledXStringSet
.  type
 type of alignment. One of
"global"
,"local"
,"overlap"
,"globallocal"
, and"localglobal"
where"global"
= align whole strings with end gap penalties,"local"
= align string fragments,"overlap"
= align whole strings without end gap penalties,"globallocal"
= align whole strings inpattern
with consecutive subsequence ofsubject
,"localglobal"
= align consecutive subsequence ofpattern
with whole strings insubject
.  substitutionMatrix
 substitution matrix representing the fixed
substitution scores for an alignment. It cannot be used in conjunction with
patternQuality
andsubjectQuality
arguments.  fuzzyMatrix
 fuzzy match matrix for qualitybased alignments. It takes values between 0 and 1; where 0 is an unambiguous mismatch, 1 is an unambiguous match, and values in between represent a fraction of "matchiness". (See details section below.)
 gapOpening
 the cost for opening a gap in the alignment.
 gapExtension
 the incremental cost incurred along the length of the gap in the alignment.
 scoreOnly
 logical to denote whether or not to return just the scores of the optimal pairwise alignment.
 ...
 optional arguments to generic function to support additional methods.
Details
Qualitybased alignments are based on the paper the Bioinformatics article by
Ketil Malde listed in the Reference section below. Let $\epsilon_i$ be the
probability of an error in the base read. For "Phred"
quality measures
$Q$ in $[0, 99]$, these error probabilities are given by
$\epsilon_i = 10^{Q/10}$. For "Solexa"
quality measures $Q$ in
$[5, 99]$, they are given by $\epsilon_i = 1  1/(1 + 10^{Q/10})$.
Assuming independence within and between base reads, the combined error
probability of a mismatch when the underlying bases do match is
$\epsilon_c = \epsilon_1 + \epsilon_2  (n/(n1)) * \epsilon_1 * \epsilon_2$,
where $n$ is the number of letters in the underlying alphabet (i.e.
$n = 4$ for DNA input, $n = 20$ for amino acid input, otherwise
$n$ is the number of distinct letters in the input).
Using $\epsilon_c$, the substitution score is given by
$b * \log_2(\gamma_{x,y} * (1  \epsilon_c) * n + (1  \gamma_{x,y}) * \epsilon_c * (n/(n1)))$,
where $b$ is the bitscaling for the scoring and $\gamma_{x,y}$ is the
probability that characters $x$ and $y$ represents the same underlying
information (e.g. using IUPAC, $\gamma_{A,A} = 1$ and $\gamma_{A,N} = 1/4$.
In the arguments listed above fuzzyMatch
represents $\gamma_{x,y}$
and patternQuality
and subjectQuality
represents $\epsilon_1$
and $\epsilon_2$ respectively.
If scoreOnly == FALSE
, a pairwise alignment with the maximum alignment
score is returned. If more than one pairwise alignment produces the maximum
alignment score, then the alignment with the smallest initial deletion whose
mismatches occur before its insertions and deletions is chosen. For example,
if pattern = "AGTA"
and subject = "AACTAACTA"
, then the alignment
pattern: [1] AGTA; subject: [1] AACTA
is chosen over
pattern: [1] AGTA; subject: [1] AACTA
or
pattern: [1] AGTA; subject: [5] AACTA
if they all achieve the maximum
alignment score.
Value

If
scoreOnly == FALSE
, an instance of class
PairwiseAlignments
or
PairwiseAlignmentsSingleSubject
is returned.
If scoreOnly == TRUE
, a numeric vector containing the scores for the
optimal pairwise alignments is returned.
Note
Use matchPattern
or vmatchPattern
if you need to
find all the occurrences (eventually with indels) of a given pattern in a
reference sequence or set of sequences.
Use matchPDict
if you need to match a (big) set of patterns
against a reference sequence.
References
R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological Sequence Analysis, Cambridge UP 1998, sec 2.3.
B. Haubold, T. Wiehe, Introduction to Computational Biology, Birkhauser Verlag 2006, Chapter 2.
K. Malde, The effect of sequence quality on sequence alignment, Bioinformatics 2008 24(7):897900.
See Also
writePairwiseAlignments
,
stringDist
,
PairwiseAlignmentsclass,
XStringQualityclass,
substitution.matrices,
matchPattern
Examples
## Nucleotide global, local, and overlap alignments
s1 <
DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG")
s2 <
DNAString("GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC")
# First use a fixed substitution matrix
mat < nucleotideSubstitutionMatrix(match = 1, mismatch = 3, baseOnly = TRUE)
globalAlign <
pairwiseAlignment(s1, s2, substitutionMatrix = mat,
gapOpening = 5, gapExtension = 2)
localAlign <
pairwiseAlignment(s1, s2, type = "local", substitutionMatrix = mat,
gapOpening = 5, gapExtension = 2)
overlapAlign <
pairwiseAlignment(s1, s2, type = "overlap", substitutionMatrix = mat,
gapOpening = 5, gapExtension = 2)
# Then use qualitybased method for generating a substitution matrix
pairwiseAlignment(s1, s2,
patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))),
subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))),
scoreOnly = TRUE)
# Now assume can't distinguish between C/T and G/A
pairwiseAlignment(s1, s2,
patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))),
subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))),
type = "local")
mapping < diag(4)
dimnames(mapping) < list(DNA_BASES, DNA_BASES)
mapping["C", "T"] < mapping["T", "C"] < 1
mapping["G", "A"] < mapping["A", "G"] < 1
pairwiseAlignment(s1, s2,
patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))),
subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))),
fuzzyMatrix = mapping,
type = "local")
## Amino acid global alignment
pairwiseAlignment(AAString("PAWHEAE"), AAString("HEAGAWGHEE"),
substitutionMatrix = "BLOSUM50",
gapOpening = 0, gapExtension = 8)