Update model parameters using a list of training sequences, with either the Viterbi training or Baum-Welch algorithm.
train(x, y, ...)# S3 method for PHMM
train(
x,
y,
method = "Viterbi",
seqweights = "Henikoff",
wfactor = 1,
k = 5,
logspace = "autodetect",
maxiter = 100,
limit = 1,
deltaLL = 1e-07,
pseudocounts = "background",
gap = "-",
fixqa = FALSE,
fixqe = FALSE,
maxsize = NULL,
inserts = "map",
threshold = 0.5,
lambda = 0,
alignment = FALSE,
cores = 1,
quiet = FALSE,
...
)
# S3 method for HMM
train(
x,
y,
method = "Viterbi",
seqweights = NULL,
wfactor = 1,
maxiter = 100,
deltaLL = 1e-07,
logspace = "autodetect",
quiet = FALSE,
modelend = FALSE,
pseudocounts = "Laplace",
...
)
an object of class "HMM"
or "PHMM"
, depending
on the input model x
.
an object of class "HMM"
or "PHMM"
specifying the
initial parameter values.
a list of training sequences whose hidden states are unknown. Accepted modes are "character" and "raw" (for "DNAbin" and "AAbin" objects).
aditional arguments to be passed to "Viterbi"
(if
method = "Viterbi"
) or "forward"
(if
method = "BaumWelch"
).
a character string specifying the iterative model training
method to use. Accepted methods are "Viterbi"
(the default)
and "BaumWelch"
.
either NULL (all sequences are given weights
of 1), a numeric vector the same length as y
representing
the sequence weights used to derive the model, or a character string giving
the method to derive the weights from the sequences
(see weight
).
numeric. The factor to multiply the sequence weights by. Defaults to 1.
integer representing the k-mer size to be used in tree-based sequence weighting (if applicable). Defaults to 5. Note that higher values of k may be slow to compute and use excessive memory due to the large numbers of calculations required.
logical indicating whether the emission and transition
probabilities of x are logged. If logspace = "autodetect"
(default setting), the function will automatically detect
if the probabilities are logged, returning an error if
inconsistencies are found. Note that choosing the latter option
increases the computational overhead; therefore specifying
TRUE
or FALSE
can reduce the running time.
the maximum number of EM iterations or Viterbi training iterations to carry out before the cycling process is terminated and the partially trained model is returned. Defaults to 100.
the proportion of alignment rows that must be identical between subsequent iterations for the Viterbi training algorithm to terminate. Defaults to 1.
numeric, the maximum change in log likelihood between EM
iterations before the cycling procedure is terminated (signifying model
convergence). Defaults to 1E-07. Only applicable if
method = "BaumWelch"
.
character string, either "background", Laplace"
or "none". Used to account for the possible absence of certain
transition and/or emission types in the input sequences.
If pseudocounts = "background"
(default), pseudocounts
are calculated from the background transition and emission
frequencies in the training dataset.
If pseudocounts = "Laplace"
one of each possible transition
and emission type is added to the training dataset (default).
If pseudocounts = "none"
no pseudocounts are added (not
usually recommended, since low frequency transition/emission types
may be excluded from the model).
Alternatively this argument can be a two-element list containing
a matrix of transition pseudocounts
as its first element and a matrix of emission pseudocounts as its
second. If this option is selected, both matrices must have row and column
names corresponding with the residues (column names of emission matrix)
and states (row and column names of the transition matrix and
row names of the emission matrix). For standard HMMs
the first row and column of the transition matrix should be named
"Begin".
the character used to represent gaps in the alignment matrix
(if applicable). Ignored for "DNAbin"
or "AAbin"
objects.
Defaults to "-" otherwise.
logical. Should the background transition probabilities
be fixed (TRUE), or allowed to vary between iterations (FALSE)?
Defaults to FALSE. Only applicable if method = "Viterbi"
.
logical. Should the background emission probabilities
be fixed (TRUE), or allowed to vary between iterations (FALSE)?
Defaults to FALSE. Only applicable if method = "Viterbi"
.
integer giving the upper bound on the number of modules in the PHMM. If NULL (default) no maximum size is enforced.
character string giving the model construction method
by which alignment columns
are marked as either match or insert states. Accepted methods include
"threshold"
(only columns with fewer than a specified
proportion of gaps form match states in the model), "map"
(default;
match and insert columns are found using the maximum a posteriori
method outlined in Durbin et al (1998) chapter 5.7), "inherited"
(match and insert columns are inherited from the input alignment),
and "none"
(all columns are assigned match states in the model).
Alternatively, insert columns can be
specified manually by providing a logical vector the same length
as the number of columns in the alignment, with TRUE
for insert
columns and FALSE
for match states.
the maximum proportion of gaps for an alignment column
to be considered for a match state in the PHMM (defaults to 0.5).
Only applicable when inserts = "threshold"
.
Note that the maximum a posteriori
method works poorly for alignments with few sequences,
so the 'threshold' method is
automatically used when the number of sequences is less than 5.
penalty parameter used to favour models with fewer match
states. Equivalent to the log of the prior probability of marking each
column (Durbin et al 1998, chapter 5.7). Only applicable when
inserts = "map"
.
logical indicating whether the alignment used to derive the final model (if applicable) should be included as an element of the returned PHMM object. Defaults to FALSE.
integer giving the number of CPUs to parallelize the operation
over. Defaults to 1, and reverts to 1 if x is not a list.
This argument may alternatively be a 'cluster' object,
in which case it is the user's responsibility to close the socket
connection at the conclusion of the operation,
for example by running parallel::stopCluster(cores)
.
The string "autodetect" is also accepted, in which case the maximum
number of cores to use is one less than the total number of cores
available. Note that in this case there
may be a tradeoff in terms of speed depending on the number and size
of sequences to be aligned, due to the extra time required to initialize
the cluster.
Only applicable if x is an object of class "PHMM"
.
logical indicating whether feedback should be printed to the console.
logical indicating whether transition probabilites to the end state of the standard hidden Markov model should be modeled (if applicable). Defaults to FALSE.
Shaun Wilkinson
This function optimizes the parameters of a hidden Markov model
(object class: "HMM"
) or profile hidden Markov model
(object class: "PHMM"
) using the methods described in
Durbin et al (1998) chapters 3.3 and 6.5, respectively.
For standard HMMs, the function assumes the state sequence is unknown
(as opposed to the deriveHMM
function, which is used
when the state sequence is known).
For profile HMMs, the input object is generally a list of non-aligned
sequences rather than an alignment (for which the derivePHMM
function may be more suitable).
This function offers a choice of two model training methods, Viterbi training (also known as the segmental K-means algorithm (Juang & Rabiner 1990)), and the Baum Welch algorithm, a special case of the expectation-maximization (EM) algorithm that iteratively finds the locally (but not necessarily globally) optimal parameters of a HMM or PHMM.
The Viterbi training method is generally much faster, particularly for
profile HMMs and when the multi-threading option is used
(see the "cores"
argument). The comparison in accuracy will depend
on the nature of the problem, but personal experience suggests that
the methods are comparable for training profile HMMs for DNA and
amino acid sequences.
Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.
Juang B-H, Rabiner LR (1990) The segmental K-means algorithm for estimating parameters of hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38, 1639-1641.
deriveHMM
and derivePHMM
for
maximum-likelihood parameter estimation when the training sequence states
are known.
## Baum Welch training for standard HMMs:
## The dishonest casino example from Durbin et al (1998) chapter 3.2
states <- c("Begin", "Fair", "Loaded")
residues <- paste(1:6)
### Define the transition probability matrix
A <- matrix(c(0, 0, 0, 0.99, 0.95, 0.1, 0.01, 0.05, 0.9), nrow = 3)
dimnames(A) <- list(from = states, to = states)
### Define the emission probability matrix
E <- matrix(c(rep(1/6, 6), rep(1/10, 5), 1/2), nrow = 2, byrow = TRUE)
dimnames(E) <- list(states = states[-1], residues = residues)
### Build and plot the HMM object
x <- structure(list(A = A, E = E), class = "HMM")
op <- par(no.readonly = TRUE)
par(mfrow = c(2, 1))
plot(x, main = "Dishonest casino HMM before training")
data(casino)
x <- train(x, list(casino), method = "BaumWelch", deltaLL = 0.001)
plot(x, main = "Dishonest casino HMM after training")
par(op)
Run the code above in your browser using DataLab