pairwiseAlignment: Optimal Pairwise Alignment

Description

Solves (Needleman-Wunsch) global alignment, (Smith-Waterman) local alignment, and (ends-free) overlap alignment problems.

Usage

pairwiseAlignment(pattern, subject, ...)
"pairwiseAlignment"(pattern, subject, patternQuality=PhredQuality(22L), subjectQuality=PhredQuality(22L), type="global", substitutionMatrix=NULL, fuzzyMatrix=NULL, gapOpening=10, gapExtension=4, scoreOnly=FALSE)
"pairwiseAlignment"(pattern, subject, type="global", substitutionMatrix=NULL, fuzzyMatrix=NULL,  gapOpening=10, gapExtension=4, scoreOnly=FALSE)

Arguments

pattern

a character vector of any length, an XString, or an XStringSet object.

subject

a character vector of length 1, an XString, or an XStringSet object of length 1.

patternQuality, subjectQuality

objects of class XStringQuality representing the respective quality scores for pattern and subject that are used in a quality-based method for generating a substitution matrix. These two arguments are ignored if !is.null(substitutionMatrix) or if its respective string set (pattern, subject) is of class QualityScaledXStringSet.

type

type of alignment. One of "global", "local", "overlap", "global-local", and "local-global" where "global" = align whole strings with end gap penalties, "local" = align string fragments, "overlap" = align whole strings without end gap penalties, "global-local" = align whole strings in pattern with consecutive subsequence of subject, "local-global" = align consecutive subsequence of pattern with whole strings in subject.

substitutionMatrix

substitution matrix representing the fixed substitution scores for an alignment. It cannot be used in conjunction with patternQuality and subjectQuality arguments.

fuzzyMatrix

fuzzy match matrix for quality-based alignments. It takes values between 0 and 1; where 0 is an unambiguous mismatch, 1 is an unambiguous match, and values in between represent a fraction of "matchiness". (See details section below.)

gapOpening

the cost for opening a gap in the alignment.

gapExtension

the incremental cost incurred along the length of the gap in the alignment.

scoreOnly

logical to denote whether or not to return just the scores of the optimal pairwise alignment.

...

optional arguments to generic function to support additional methods.

Value

If scoreOnly == FALSE, an instance of class PairwiseAlignments or PairwiseAlignmentsSingleSubject is returned. If scoreOnly == TRUE, a numeric vector containing the scores for the optimal pairwise alignments is returned.

Details

Quality-based alignments are based on the paper the Bioinformatics article by Ketil Malde listed in the Reference section below. Let $\epsilon_i$ be the probability of an error in the base read. For "Phred" quality measures $Q$ in $[0, 99]$, these error probabilities are given by $\epsilon_i = 10^{-Q/10}$. For "Solexa" quality measures $Q$ in $[-5, 99]$, they are given by $\epsilon_i = 1 - 1/(1 + 10^{-Q/10})$. Assuming independence within and between base reads, the combined error probability of a mismatch when the underlying bases do match is $\epsilon_c = \epsilon_1 + \epsilon_2 - (n/(n-1)) * \epsilon_1 * \epsilon_2$, where $n$ is the number of letters in the underlying alphabet (i.e. $n = 4$ for DNA input, $n = 20$ for amino acid input, otherwise $n$ is the number of distinct letters in the input). Using $\epsilon_c$, the substitution score is given by $b * \log_2(\gamma_{x,y} * (1 - \epsilon_c) * n + (1 - \gamma_{x,y}) * \epsilon_c * (n/(n-1)))$, where $b$ is the bit-scaling for the scoring and $\gamma_{x,y}$ is the probability that characters $x$ and $y$ represents the same underlying information (e.g. using IUPAC, $\gamma_{A,A} = 1$ and $\gamma_{A,N} = 1/4$. In the arguments listed above fuzzyMatch represents $\gamma_{x,y}$ and patternQuality and subjectQuality represents $\epsilon_1$ and $\epsilon_2$ respectively.

If scoreOnly == FALSE, a pairwise alignment with the maximum alignment score is returned. If more than one pairwise alignment produces the maximum alignment score, then the alignment with the smallest initial deletion whose mismatches occur before its insertions and deletions is chosen. For example, if pattern = "AGTA" and subject = "AACTAACTA", then the alignment pattern: [1] AG-TA; subject: [1] AACTA is chosen over pattern: [1] A-GTA; subject: [1] AACTA or pattern: [1] AG-TA; subject: [5] AACTA if they all achieve the maximum alignment score.

References

R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological Sequence Analysis, Cambridge UP 1998, sec 2.3.

B. Haubold, T. Wiehe, Introduction to Computational Biology, Birkhauser Verlag 2006, Chapter 2.

K. Malde, The effect of sequence quality on sequence alignment, Bioinformatics 2008 24(7):897-900.

Examples

Run this code

  ## Nucleotide global, local, and overlap alignments
  s1 <- 
    DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG")
  s2 <-
    DNAString("GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC")

  # First use a fixed substitution matrix
  mat <- nucleotideSubstitutionMatrix(match = 1, mismatch = -3, baseOnly = TRUE)
  globalAlign <-
    pairwiseAlignment(s1, s2, substitutionMatrix = mat,
                      gapOpening = 5, gapExtension = 2)
  localAlign <-
    pairwiseAlignment(s1, s2, type = "local", substitutionMatrix = mat,
                      gapOpening = 5, gapExtension = 2)
  overlapAlign <-
    pairwiseAlignment(s1, s2, type = "overlap", substitutionMatrix = mat,
                      gapOpening = 5, gapExtension = 2)

  # Then use quality-based method for generating a substitution matrix
  pairwiseAlignment(s1, s2,
                    patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))),
                    subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))),
                    scoreOnly = TRUE)

  # Now assume can't distinguish between C/T and G/A
  pairwiseAlignment(s1, s2,
                    patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))),
                    subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))),
                    type = "local")
  mapping <- diag(4)
  dimnames(mapping) <- list(DNA_BASES, DNA_BASES)
  mapping["C", "T"] <- mapping["T", "C"] <- 1
  mapping["G", "A"] <- mapping["A", "G"] <- 1
  pairwiseAlignment(s1, s2,
                    patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))),
                    subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))),
                    fuzzyMatrix = mapping,
                    type = "local")

  ## Amino acid global alignment
  pairwiseAlignment(AAString("PAWHEAE"), AAString("HEAGAWGHEE"),
                    substitutionMatrix = "BLOSUM50",
                    gapOpening = 0, gapExtension = 8)

Run the code above in your browser using DataLab