AlignProfiles: Align Two Sets of Aligned Sequences

Description

Aligns two sets of one or more aligned sequences by first generating representative profiles, then aligning the profiles with dynamic programming, and finally merging the two aligned sequence sets.

Usage

AlignProfiles(pattern, subject, p.weight = 1, s.weight = 1, perfectMatch = NULL, misMatch = NULL, gapOpening = NULL, gapExtension = NULL, terminalGap = -1, restrict = -1000, anchor = 0.7, substitutionMatrix = NULL, processors = NULL)

Arguments

pattern

An AAStringSet, DNAStringSet, or RNAStringSet object of aligned sequences to use as the pattern.

subject

A XStringSet object of aligned sequences to use as the subject. Must match the type of the pattern.

p.weight

A numeric vector of weights for each sequence in the pattern to use in generating a profile, or a single number implying equal weights.

s.weight

A numeric vector of weights for each sequence in the subject to use in generating a profile, or a single number implying equal weights.

perfectMatch

Numeric giving the reward for aligning two matching nucleotides in the alignment, or NULL to determine the value based on input type (DNA/RNA/AA).

misMatch

Numeric giving the cost for aligning two mismatched nucleotides in the alignment, or NULL to determine the value based on input type (DNA/RNA/AA).

gapOpening

Numeric giving the cost for opening a gap in the alignment, or NULL to determine the value based on input type (DNA/RNA/AA).

gapExtension

Numeric giving the cost for extending an open gap in the alignment, or NULL to determine the value based on input type (DNA/RNA/AA).

terminalGap

Numeric giving the cost for allowing leading and trailing gaps ("-" or "." characters) in the alignment. Either two numbers, the first for leading gaps and the second for trailing gaps, or a single number for both.

restrict

Numeric specifying the lowest relative score to consider when aligning. The default (-1000) will align most inputs that can reasonably be globally aligned without any loss in accuracy. Input sequences with high similarity could be more restricted (e.g., -500), whereas a pattern and subject with little overlap should be less restricted (e.g., -10000). (See details section below.)

anchor

Numeric giving the fraction of sequences with identical k-mers required to become an anchor point, or NA to not use anchors. (See details section below.)

substitutionMatrix

Either a substitution matrix representing the substitution scores for an alignment or the name of the amino acid substitution matrix to use in alignment. The latter may be one of the following: ``BLOSUM45'', ``BLOSUM50'', ``BLOSUM62'', ``BLOSUM80'', ``BLOSUM100'', ``PAM30'', ``PAM40'', ``PAM70'', ``PAM120'', ``PAM250''. The default (NULL) will use the perfectMatch and misMatch penalties for DNA/RNA or ``BLOSUM62'' for AA. (See examples section below.)

processors

The number of processors to use, or NULL (the default) for all available processors.

Value

An XStringSet of aligned sequences.

Details

Profiles are aligned using dynamic programming, a variation of the Needleman-Wunsch algorithm for global alignment. The dynamic programming method requires order N*M time and memory space where N and M are the width of the pattern and subject. This method works by filling in a matrix of the possible ``alignment space'' by considering all matches, insertions, and deletions between two sequence profiles. The highest scoring alignment is then used to add gaps to each of the input sequence sets.

Heuristics can be useful to improve performance on long input sequences. The restrict parameter can be used to dynamically constrain the possible ``alignment space'' to only paths that will likely include the final alignment, which in the best case can improve the speed from quadratic time to linear time. The degree of restriction is important, and if the sequences are not mostly overlapping then restrict should be relaxed (more negative than the default). For example, if aligning a pattern to a long subject then restrict should be set to -Inf.

The argument anchor can be used to split the global alignment into multiple sub-alignments. This can greatly decrease the memory requirement for long sequences when appropriate anchor points can be found. Anchors are 15-mer (for DNA/RNA) or 7-mer (for AA) subsequences that are shared between at least anchor fraction of pattern(s) and subject(s). Anchored ranges are extended along the length of each sequence in a manner designed to split the alignment into sub-alignments that can be separately solved. For most input sequences anchor has no effect on accuracy, but anchoring can be disabled by setting anchor=NA.

Any of the input scores (perfectMatch, misMatch, gapOpening, \& gapExtension) that are NULL (the default) will be set based on the sequence type. For DNA inputs, the scores are 6, -2, -11, \& -3, respectively. For RNA they are 8, -3, -9, -2, and for AA 4, 0, -5, code-3. These values were optimized for performance using structural benchmarks for RNA/AA and an evolutionary benchmark for DNA, where the optimal alignment is known. The value of terminalGap can be varied based on the input sequences to penalize insertion of gaps at the end of the sequences less or more. For sequences with highly variable lengths, a value of 0 may be preferred.

References

Needleman S., Wunsch, C. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443-453.

Examples

Run this code

# align two sets of sequences
db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER")
dna1 <- SearchDB(db, remove="common", limit=100) # the first 100 sequences
dna2 <- SearchDB(db, remove="common", limit="100,100") # the rest
alignedDNA <- AlignProfiles(dna1, dna2)
BrowseSequences(alignedDNA, highlight=1)

# specify a DNA substitution matrix
bases <- c("A", "C", "G", "T")
subMatrix <- matrix(-3, nrow=4, ncol=4, dimnames=list(bases, bases))
diag(subMatrix) <- 6 # perfectMatch
alignedDNA.defaultSubM <- AlignProfiles(dna1, dna2, substitutionMatrix=subMatrix)
all(alignedDNA.defaultSubM==alignedDNA)

Run the code above in your browser using DataLab