align
performs a multiple alignment on a list of
sequences using profile hidden Markov models.
align(x, ...)# S3 method for DNAbin
align(
x,
model = NULL,
progressive = FALSE,
seeds = NULL,
seqweights = "Henikoff",
refine = "Viterbi",
k = 5,
maxiter = 100,
maxsize = NULL,
inserts = "map",
lambda = 0,
threshold = 0.5,
deltaLL = 1e-07,
DI = FALSE,
ID = FALSE,
residues = NULL,
gap = "-",
pseudocounts = "background",
qa = NULL,
qe = NULL,
cores = 1,
quiet = FALSE,
...
)
# S3 method for AAbin
align(
x,
model = NULL,
progressive = FALSE,
seeds = NULL,
seqweights = "Henikoff",
refine = "Viterbi",
k = 5,
maxiter = 100,
maxsize = NULL,
inserts = "map",
lambda = 0,
threshold = 0.5,
deltaLL = 1e-07,
DI = FALSE,
ID = FALSE,
residues = NULL,
gap = "-",
pseudocounts = "background",
qa = NULL,
qe = NULL,
cores = 1,
quiet = FALSE,
...
)
# S3 method for list
align(
x,
model = NULL,
progressive = FALSE,
seeds = NULL,
seqweights = "Henikoff",
k = 5,
refine = "Viterbi",
maxiter = 100,
maxsize = NULL,
inserts = "map",
lambda = 0,
threshold = 0.5,
deltaLL = 1e-07,
DI = FALSE,
ID = FALSE,
residues = NULL,
gap = "-",
pseudocounts = "background",
qa = NULL,
qe = NULL,
cores = 1,
quiet = FALSE,
...
)
# S3 method for default
align(
x,
model,
pseudocounts = "background",
residues = NULL,
gap = "-",
maxsize = NULL,
quiet = FALSE,
...
)
a matrix of aligned sequences, with the same mode and class as the input sequence list.
a list of DNA, amino acid, or other character sequences
consisting of symbols emitted from the chosen residue alphabet.
The vectors can either be of mode "raw" (consistent with the "DNAbin"
or "AAbin" coding scheme set out in the ape
package),
or "character", in which case the alphabet should be specified in
the residues
argument. This argument can alternatively be a
vector representing a single sequence. In this case, and if the
second argument is also a single sequence, a standard pairwise
alignment is returned.
aditional arguments to be passed to "Viterbi"
(if
refine = "Viterbi"
) or "forward"
(if
refine = "BaumWelch"
).
an optional profile hidden Markov model (a "PHMM"
object) to align the sequences to. If NULL
a PHMM will
be derived from the list of sequences, and each sequence
will be aligned back to the model to produce the multiple sequence
alignment.
logical indicating whether the alignment used to derive the initial model parameters should be built progressively (assuming input is a list of unaligned sequences, ignored otherwise). Defaults to FALSE, in which case the longest sequence or sequences are used (faster, but possibly less accurate).
optional integer vector indicating which sequences should
be used as seeds for building the guide tree for the progressive
alignment (assuming input is a list of unaligned sequences,
and progressive = TRUE
, ignored otherwise).
Defaults to NULL, in which a set of log(n, 2)^2 non-identical
sequences are chosen from the list of sequences by k-means clustering.
either NULL (all sequences are given weights
of 1), a numeric vector the same length as x
representing
the sequence weights used to derive the model, or a character string giving
the method to derive the weights from the sequences
(see weight
).
the method used to iteratively refine the model parameters
following the initial progressive alignment and model derivation step.
Current supported options are "Viterbi"
(Viterbi training;
the default option), "BaumWelch"
(a modified version of the
Expectation-Maximization algorithm), and "none" (skips the model
refinement step).
integer representing the k-mer size to be used in tree-based sequence weighting (if applicable). Defaults to 5. Note that higher values of k may be slow to compute and use excessive memory due to the large numbers of calculations required.
the maximum number of EM iterations or Viterbi training iterations to carry out before the cycling process is terminated and the partially trained model is returned. Defaults to 100.
integer giving the upper bound on the number of modules in the PHMM. If NULL no maximum size is enforced.
character string giving the model construction method
in which alignment columns
are marked as either match or insert states. Accepted methods include
"threshold"
(only columns with fewer than a specified
proportion of gaps form match states in the model), "map"
(default;
match and insert columns are found using the maximum a posteriori
method outlined in Durbin et al (1998) chapter 5.7), "inherited"
(match and insert columns are inherited from the input alignment),
and "none"
(all columns are assigned match states in the model).
Alternatively, insert columns can be
specified manually by providing a logical vector the same length
as the number of columns in the alignment, with TRUE
for insert
columns and FALSE
for match states.
penalty parameter used to favour models with fewer match
states. Equivalent to the log of the prior probability of marking each
column (Durbin et al 1998, chapter 5.7). Only applicable when
inserts = "map"
.
the maximum proportion of gaps for an alignment column
to be considered for a match state in the PHMM (defaults to 0.5).
Only applicable when inserts = "threshold"
.
Note that the maximum a posteriori
method works poorly for alignments with few sequences,
so the 'threshold' method is
automatically used when the number of sequences is less than 5.
numeric, the maximum change in log likelihood between EM
iterations before the cycling procedure is terminated (signifying model
convergence). Defaults to 1E-07. Only applicable if
method = "BaumWelch"
.
logical indicating whether delete-insert transitions should be allowed in the profile hidden Markov model (if applicable). Defaults to FALSE.
logical indicating whether insert-delete transitions should be allowed in the profile hidden Markov model (if applicable). Defaults to FALSE.
either NULL (default; emitted residues are automatically
detected from the sequences), a case sensitive character vector
specifying the residue alphabet, or one of the character strings
"RNA", "DNA", "AA", "AMINO". Note that the default option can be slow for
large lists of character vectors. Furthermore, the default setting
residues = NULL
will not detect rare residues that are not present
in the sequences, and thus will not assign them emission probabilities
in the model. Specifying the residue alphabet is therefore
recommended unless x is a "DNAbin" or "AAbin" object.
the character used to represent gaps in the alignment matrix.
Ignored for "DNAbin"
or "AAbin"
objects. Defaults to "-"
otherwise.
character string, either "background", Laplace"
or "none". Used to account for the possible absence of certain
transition and/or emission types in the input sequences.
If pseudocounts = "background"
(default), pseudocounts
are calculated from the background transition and emission
frequencies in the sequences.
If pseudocounts = "Laplace"
one of each possible transition
and emission type is added to the transition and emission counts.
If pseudocounts = "none"
no pseudocounts are added (not
generally recommended, since low frequency transition/emission types
may be excluded from the model).
Alternatively this argument can be a two-element list containing
a matrix of transition pseudocounts
as its first element and a matrix of emission pseudocounts as its
second.
an optional named 9-element vector of background transition
probabilities with dimnames(qa) = c("DD", "DM", "DI", "MD", "MM",
"MI", "ID", "IM", "II")
, where M, I and D represent match, insert and
delete states, respectively. If NULL
, background transition
probabilities are estimated from the sequences.
an optional named vector of background emission probabilities
the same length as the residue alphabet (i.e. 4 for nucleotides and 20
for amino acids) and with corresponding names (i.e. c("A", "T",
"G", "C")
for DNA). If qe = NULL
, background emission probabilities
are automatically derived from the sequences.
integer giving the number of CPUs to parallelize the operation
over. Defaults to 1, and reverts to 1 if x is not a list.
This argument may alternatively be a 'cluster' object,
in which case it is the user's responsibility to close the socket
connection at the conclusion of the operation,
for example by running parallel::stopCluster(cores)
.
The string 'autodetect' is also accepted, in which case the maximum
number of cores to use is one less than the total number of cores available.
Note that in this case there
may be a tradeoff in terms of speed depending on the number and size
of sequences to be aligned, due to the extra time required to initialize
the cluster.
logical indicating whether feedback should be printed to the console.
Shaun Wilkinson
This function builds a multiple sequence alignment using profile hidden Markov models. The default behaviour is to select the longest sequence in the set that had the lowest sequence weight, derive a profile HMM from the single sequence, and iteratively train the model using the entire sequence set. Training can be achieved using either the Baum Welch or Viterbi training algorithm, with the latter being significantly faster, particularly when multi-threading is used. Once the model parameters have converged (Baum Welch) or no variation is seen in the sequential alignments (Viterbi training), the sequences are aligned to the profile HMM to produce the alignment matrix. The preceeding steps can be omitted if a pre-trained profile HMM is passed to the function via the "model" argument.
If progressive = TRUE
the function alternatively uses a
progressive alignment procedure similar to the Clustal Omega algorithm
(Sievers et al 2011). The involves an initial progressive multiple
sequence alignment via a guide tree,
followed by the derivation of a profile hidden Markov model
from the alignment, an iterative model refinement step,
and finally the alignment of the sequences back to the model as above.
If only two sequences are provided, a standard pairwise alignment is carried out using the Needleman-Wunch or Smith-Waterman algorithm.
Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Soding J, Thompson JD, Higgins DG (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology, 7, 539.
unalign
## Protein pairwise alignment example from Durbin et al (1998) chapter 2.
x <- c("H", "E", "A", "G", "A", "W", "G", "H", "E", "E")
y <- c("P", "A", "W", "H", "E", "A", "E")
sequences <- list(x = x, y = y)
glo <- align(sequences, type = "global")
sem <- align(sequences, type = "semiglobal")
loc <- align(sequences, type = "local")
glo
sem
loc
# \donttest{
## Deconstruct the woodmouse alignment and re-align
library(ape)
data(woodmouse)
tmp <- unalign(woodmouse)
x <- align(tmp, windowspace = "WilburLipman")
# }
Run the code above in your browser using DataLab