train: Iterative model refinement.

Description

Update model parameters using a list of training sequences, with either the Viterbi training or Baum-Welch algorithm.

Usage

train(x, y, ...)
# S3 method for PHMM
train(
  x,
  y,
  method = "Viterbi",
  seqweights = "Henikoff",
  wfactor = 1,
  k = 5,
  logspace = "autodetect",
  maxiter = 100,
  limit = 1,
  deltaLL = 1e-07,
  pseudocounts = "background",
  gap = "-",
  fixqa = FALSE,
  fixqe = FALSE,
  maxsize = NULL,
  inserts = "map",
  threshold = 0.5,
  lambda = 0,
  alignment = FALSE,
  cores = 1,
  quiet = FALSE,
  ...
)
# S3 method for HMM
train(
  x,
  y,
  method = "Viterbi",
  seqweights = NULL,
  wfactor = 1,
  maxiter = 100,
  deltaLL = 1e-07,
  logspace = "autodetect",
  quiet = FALSE,
  modelend = FALSE,
  pseudocounts = "Laplace",
  ...
)

Value

an object of class "HMM" or "PHMM", depending on the input model x.

Arguments

x: an object of class "HMM" or "PHMM" specifying the initial parameter values.
y: a list of training sequences whose hidden states are unknown. Accepted modes are "character" and "raw" (for "DNAbin" and "AAbin" objects).
...: aditional arguments to be passed to "Viterbi" (if method = "Viterbi") or "forward" (if method = "BaumWelch").
method: a character string specifying the iterative model training method to use. Accepted methods are "Viterbi" (the default) and "BaumWelch".
seqweights: either NULL (all sequences are given weights of 1), a numeric vector the same length as y representing the sequence weights used to derive the model, or a character string giving the method to derive the weights from the sequences (see weight).
wfactor: numeric. The factor to multiply the sequence weights by. Defaults to 1.
k: integer representing the k-mer size to be used in tree-based sequence weighting (if applicable). Defaults to 5. Note that higher values of k may be slow to compute and use excessive memory due to the large numbers of calculations required.
logspace: logical indicating whether the emission and transition probabilities of x are logged. If logspace = "autodetect" (default setting), the function will automatically detect if the probabilities are logged, returning an error if inconsistencies are found. Note that choosing the latter option increases the computational overhead; therefore specifying TRUE or FALSE can reduce the running time.
maxiter: the maximum number of EM iterations or Viterbi training iterations to carry out before the cycling process is terminated and the partially trained model is returned. Defaults to 100.
limit: the proportion of alignment rows that must be identical between subsequent iterations for the Viterbi training algorithm to terminate. Defaults to 1.
deltaLL: numeric, the maximum change in log likelihood between EM iterations before the cycling procedure is terminated (signifying model convergence). Defaults to 1E-07. Only applicable if method = "BaumWelch".
pseudocounts: character string, either "background", Laplace" or "none". Used to account for the possible absence of certain transition and/or emission types in the input sequences. If pseudocounts = "background" (default), pseudocounts are calculated from the background transition and emission frequencies in the training dataset. If pseudocounts = "Laplace" one of each possible transition and emission type is added to the training dataset (default). If pseudocounts = "none" no pseudocounts are added (not usually recommended, since low frequency transition/emission types may be excluded from the model). Alternatively this argument can be a two-element list containing a matrix of transition pseudocounts as its first element and a matrix of emission pseudocounts as its second. If this option is selected, both matrices must have row and column names corresponding with the residues (column names of emission matrix) and states (row and column names of the transition matrix and row names of the emission matrix). For standard HMMs the first row and column of the transition matrix should be named "Begin".
gap: the character used to represent gaps in the alignment matrix (if applicable). Ignored for "DNAbin" or "AAbin" objects. Defaults to "-" otherwise.
fixqa: logical. Should the background transition probabilities be fixed (TRUE), or allowed to vary between iterations (FALSE)? Defaults to FALSE. Only applicable if method = "Viterbi".
fixqe: logical. Should the background emission probabilities be fixed (TRUE), or allowed to vary between iterations (FALSE)? Defaults to FALSE. Only applicable if method = "Viterbi".
maxsize: integer giving the upper bound on the number of modules in the PHMM. If NULL (default) no maximum size is enforced.
inserts: character string giving the model construction method by which alignment columns are marked as either match or insert states. Accepted methods include "threshold" (only columns with fewer than a specified proportion of gaps form match states in the model), "map" (default; match and insert columns are found using the maximum a posteriori method outlined in Durbin et al (1998) chapter 5.7), "inherited" (match and insert columns are inherited from the input alignment), and "none" (all columns are assigned match states in the model). Alternatively, insert columns can be specified manually by providing a logical vector the same length as the number of columns in the alignment, with TRUE for insert columns and FALSE for match states.
threshold: the maximum proportion of gaps for an alignment column to be considered for a match state in the PHMM (defaults to 0.5). Only applicable when inserts = "threshold". Note that the maximum a posteriori method works poorly for alignments with few sequences, so the 'threshold' method is automatically used when the number of sequences is less than 5.
lambda: penalty parameter used to favour models with fewer match states. Equivalent to the log of the prior probability of marking each column (Durbin et al 1998, chapter 5.7). Only applicable when inserts = "map".
alignment: logical indicating whether the alignment used to derive the final model (if applicable) should be included as an element of the returned PHMM object. Defaults to FALSE.
cores: integer giving the number of CPUs to parallelize the operation over. Defaults to 1, and reverts to 1 if x is not a list. This argument may alternatively be a 'cluster' object, in which case it is the user's responsibility to close the socket connection at the conclusion of the operation, for example by running parallel::stopCluster(cores). The string "autodetect" is also accepted, in which case the maximum number of cores to use is one less than the total number of cores available. Note that in this case there may be a tradeoff in terms of speed depending on the number and size of sequences to be aligned, due to the extra time required to initialize the cluster. Only applicable if x is an object of class "PHMM".
quiet: logical indicating whether feedback should be printed to the console.
modelend: logical indicating whether transition probabilites to the end state of the standard hidden Markov model should be modeled (if applicable). Defaults to FALSE.

Author

Shaun Wilkinson

Details

This function optimizes the parameters of a hidden Markov model (object class: "HMM") or profile hidden Markov model (object class: "PHMM") using the methods described in Durbin et al (1998) chapters 3.3 and 6.5, respectively. For standard HMMs, the function assumes the state sequence is unknown (as opposed to the deriveHMM function, which is used when the state sequence is known). For profile HMMs, the input object is generally a list of non-aligned sequences rather than an alignment (for which the derivePHMM function may be more suitable).

This function offers a choice of two model training methods, Viterbi training (also known as the segmental K-means algorithm (Juang & Rabiner 1990)), and the Baum Welch algorithm, a special case of the expectation-maximization (EM) algorithm that iteratively finds the locally (but not necessarily globally) optimal parameters of a HMM or PHMM.

The Viterbi training method is generally much faster, particularly for profile HMMs and when the multi-threading option is used (see the "cores" argument). The comparison in accuracy will depend on the nature of the problem, but personal experience suggests that the methods are comparable for training profile HMMs for DNA and amino acid sequences.

References

Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.

Juang B-H, Rabiner LR (1990) The segmental K-means algorithm for estimating parameters of hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38, 1639-1641.

Examples

Run this code

  ## Baum Welch training for standard HMMs:
  ## The dishonest casino example from Durbin et al (1998) chapter 3.2
  states <- c("Begin", "Fair", "Loaded")
  residues <- paste(1:6)
  ### Define the transition probability matrix
  A <- matrix(c(0, 0, 0, 0.99, 0.95, 0.1, 0.01, 0.05, 0.9), nrow = 3)
  dimnames(A) <- list(from = states, to = states)
  ### Define the emission probability matrix
  E <- matrix(c(rep(1/6, 6), rep(1/10, 5), 1/2), nrow = 2, byrow = TRUE)
  dimnames(E) <- list(states = states[-1], residues = residues)
  ### Build and plot the HMM object
  x <- structure(list(A = A, E = E), class = "HMM")
  op <- par(no.readonly = TRUE)
  par(mfrow = c(2, 1))
  plot(x, main = "Dishonest casino HMM before training")
  data(casino)
  x <- train(x, list(casino), method = "BaumWelch", deltaLL = 0.001)
  plot(x, main = "Dishonest casino HMM after training")
  par(op)

Run the code above in your browser using DataLab