AlignProfiles(pattern, subject, p.weight = 1, s.weight = 1, p.struct = NULL, s.struct = NULL, perfectMatch = 6, misMatch = 0, gapOpening = -13, gapExtension = -1, gapPower = -1, terminalGap = -5, restrict = -1000, anchor = 0.7, normPower = 1, substitutionMatrix = NULL, structureMatrix = NULL, processors = NULL)
AAStringSet
, DNAStringSet
, or RNAStringSet
object of aligned sequences to use as the pattern.
XStringSet
object of aligned sequences to use as the subject. Must match the type of the pattern.
NULL
(the default), or a list of matrices with one list element per sequence in the pattern. (See details section below.)
NULL
(the default), or a list of matrices with one list element per sequence in the subject. (See details section below.)
DNAStringSet
or RNAStringSet
inputs.
DNAStringSet
or RNAStringSet
inputs.
-1000
) will align most inputs that can reasonably be globally aligned without any loss in accuracy. Input sequences with high similarity could be more restricted (e.g., -500
), whereas a pattern
and subject
with little overlap should be less restricted (e.g., -10000
). (See details section below.)
NA
to not use anchors. (See details section below.)
normPower
of 0
does not normalize the scores, which results in all columns of the profiles being weighted equally, and is the optimal value for aligning fragmentary sequences. A normPower
of 1
normalizes the score for aligning two positions by their column occupancy (1 - fraction of gaps). A normPower
greater than 1
more strongly discourages aligning with ``gappy'' regions of the alignment.
perfectMatch
and misMatch
penalties for DNA/RNA or ``MIQS'' for AA. (See examples section below.)
p.struct
and s.struct
are supplied, or NULL
otherwise. (See examples section below.)
NULL
(the default) for all available processors.
XStringSet
of aligned sequences.
N*M
time and memory space where N
and M
are the width of the pattern and subject. This method works by filling in a matrix of the possible ``alignment space'' by considering all matches, insertions, and deletions between two sequence profiles. The highest scoring alignment is then used to add gaps to each of the input sequence sets.Heuristics can be useful to improve performance on long input sequences. The restrict
parameter can be used to dynamically constrain the possible ``alignment space'' to only paths that will likely include the final alignment, which in the best case can improve the speed from quadratic time to linear time. The degree of restriction is important, and if the sequences are not mostly overlapping then restrict
should be relaxed (more negative than the default). For example, if aligning a pattern to a much longer subject then restrict should be set to -1e10
to prevent restriction.
The argument anchor
can be used to split the global alignment into multiple sub-alignments. This can greatly decrease the memory requirement for long sequences when appropriate anchor points can be found. Anchors are 15-mer (for DNA/RNA) or 7-mer (for AA) subsequences that are shared between at least anchor
fraction of pattern
(s) and subject
(s). Anchored ranges are extended along the length of each sequence in a manner designed to split the alignment into sub-alignments that can be separately solved. For most input sequences anchor
has no effect on accuracy, but anchoring can be disabled by setting anchor=NA
.
The arguments p.struct
and s.struct
may be used to provide secondary structure probabilities. Each list element must contain a matrix with dimensions q*w
, where q
is the number of possible secondary structure states, and w
is the width of the unaligned pattern sequence. Values in each matrix represent the probability of the given state at that position in the sequence. A structureMatrix
must be supplied along with the structures. The function PredictHEC
can be used to predict secondary structure probabilities in the format required by AlignProfiles
.
The gap cost function is based on the observation that gap lengths are best approximated by a Zipfian distribution (Chang & Benner, 2004). The cost of inserting a gap of length L
is equal to:
gapOpening + gapExtension*sum(seq_len(L - 1)^gapPower)
when L > 1
, and gapOpen
when L = 1
. This function effectively penalizes shorter gaps significantly more than longer gaps when gapPower < 0
, and is equivalent to the affine gap penalty when gapPower
is 0
.
Needleman S., Wunsch, C. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443-453.
AlignDB
, AlignSeqs
, AlignTranslation
, MIQS
# align two sets of sequences
db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER")
dna1 <- SearchDB(db, remove="common", limit=100) # the first 100 sequences
dna2 <- SearchDB(db, remove="common", limit="100,100") # the rest
alignedDNA <- AlignProfiles(dna1, dna2)
BrowseSeqs(alignedDNA, highlight=1)
# specify a DNA substitution matrix
bases <- c("A", "C", "G", "T")
subMatrix <- matrix(0, nrow=4, ncol=4, dimnames=list(bases, bases))
diag(subMatrix) <- 6 # perfectMatch
alignedDNA.defaultSubM <- AlignProfiles(dna1, dna2, substitutionMatrix=subMatrix)
all(alignedDNA.defaultSubM==alignedDNA)
Run the code above in your browser using DataLab