derivePHMM: Derive a profile hidden Markov model from sequences.

Description

derivePHMM generates a profile HMM from a given multiple sequence alignment or a list of unaligned sequences.

Usage

derivePHMM(x, ...)
# S3 method for DNAbin
derivePHMM(
  x,
  seqweights = "Henikoff",
  wfactor = 1,
  k = 5,
  residues = NULL,
  gap = "-",
  endchar = "?",
  pseudocounts = "background",
  logspace = TRUE,
  qa = NULL,
  qe = NULL,
  maxsize = NULL,
  inserts = "map",
  threshold = 0.5,
  lambda = 0,
  DI = FALSE,
  ID = FALSE,
  omit.endgaps = FALSE,
  name = NULL,
  description = NULL,
  compo = FALSE,
  consensus = FALSE,
  alignment = FALSE,
  progressive = FALSE,
  seeds = NULL,
  refine = "Viterbi",
  maxiter = 100,
  deltaLL = 1e-07,
  cpp = TRUE,
  cores = 1,
  quiet = FALSE,
  ...
)
# S3 method for AAbin
derivePHMM(
  x,
  seqweights = "Henikoff",
  wfactor = 1,
  k = 5,
  residues = NULL,
  gap = "-",
  endchar = "?",
  pseudocounts = "background",
  logspace = TRUE,
  qa = NULL,
  qe = NULL,
  maxsize = NULL,
  inserts = "map",
  threshold = 0.5,
  lambda = 0,
  DI = FALSE,
  ID = FALSE,
  omit.endgaps = FALSE,
  name = NULL,
  description = NULL,
  compo = FALSE,
  consensus = FALSE,
  alignment = FALSE,
  progressive = FALSE,
  seeds = NULL,
  refine = "Viterbi",
  maxiter = 100,
  deltaLL = 1e-07,
  cpp = TRUE,
  cores = 1,
  quiet = FALSE,
  ...
)
# S3 method for list
derivePHMM(
  x,
  progressive = FALSE,
  seeds = NULL,
  refine = "Viterbi",
  maxiter = 100,
  deltaLL = 1e-07,
  seqweights = "Henikoff",
  wfactor = 1,
  k = 5,
  residues = NULL,
  gap = "-",
  pseudocounts = "background",
  logspace = TRUE,
  qa = NULL,
  qe = NULL,
  maxsize = NULL,
  inserts = "map",
  lambda = 0,
  DI = FALSE,
  ID = FALSE,
  threshold = 0.5,
  omit.endgaps = FALSE,
  name = NULL,
  description = NULL,
  compo = FALSE,
  consensus = FALSE,
  alignment = FALSE,
  cpp = TRUE,
  cores = 1,
  quiet = FALSE,
  ...
)
# S3 method for default
derivePHMM(
  x,
  seqweights = "Henikoff",
  wfactor = 1,
  k = 5,
  residues = NULL,
  gap = "-",
  endchar = "?",
  pseudocounts = "background",
  logspace = TRUE,
  qa = NULL,
  qe = NULL,
  maxsize = NULL,
  inserts = "map",
  lambda = 0,
  threshold = 0.5,
  DI = FALSE,
  ID = FALSE,
  omit.endgaps = FALSE,
  name = NULL,
  description = NULL,
  compo = FALSE,
  consensus = FALSE,
  alignment = FALSE,
  cpp = TRUE,
  quiet = FALSE,
  ...
)

Value

an object of class "PHMM"

Arguments

x: a matrix of aligned sequences or a list of unaligned sequences. Accepted modes are "character" and "raw" (for "DNAbin" and "AAbin" objects).
...: aditional arguments to be passed to "Viterbi" (if refine = "Viterbi") or "forward" (if refine = "BaumWelch").
seqweights: either NULL (all sequences are given weights of 1), a numeric vector the same length as x representing the sequence weights used to derive the model, or a character string giving the method to derive the weights from the sequences (see weight).
wfactor: numeric. The factor to multiply the sequence weights by. Defaults to 1.
k: integer representing the k-mer size to be used in tree-based sequence weighting (if applicable). Defaults to 5. Note that higher values of k may be slow to compute and use excessive memory due to the large numbers of calculations required.
residues: either NULL (default; emitted residues are automatically detected from the sequences), a case sensitive character vector specifying the residue alphabet, or one of the character strings "RNA", "DNA", "AA", "AMINO". Note that the default option can be slow for large lists of character vectors. Furthermore, the default setting residues = NULL will not detect rare residues that are not present in the sequences, and thus will not assign them emission probabilities in the model. Specifying the residue alphabet is therefore recommended unless x is a "DNAbin" or "AAbin" object.
gap: the character used to represent gaps in the alignment matrix. Ignored for "DNAbin" or "AAbin" objects. Defaults to "-" otherwise.
endchar: the character used to represent unknown residues in the alignment matrix (if applicable). Ignored for "DNAbin" or "AAbin" objects. Defaults to "?" otherwise.
pseudocounts: character string, either "background", Laplace" or "none". Used to account for the possible absence of certain transition and/or emission types in the input sequences. If pseudocounts = "background" (default), pseudocounts are calculated from the background transition and emission frequencies in the sequences. If pseudocounts = "Laplace" one of each possible transition and emission type is added to the transition and emission counts. If pseudocounts = "none" no pseudocounts are added (not generally recommended, since low frequency transition/emission types may be excluded from the model). Alternatively this argument can be a two-element list containing a matrix of transition pseudocounts as its first element and a matrix of emission pseudocounts as its second.
logspace: logical indicating whether the emission and transition probabilities in the returned model should be logged. Defaults to TRUE.
qa: an optional named 9-element vector of background transition probabilities with dimnames(qa) = c("DD", "DM", "DI", "MD", "MM", "MI", "ID", "IM", "II"), where M, I and D represent match, insert and delete states, respectively. If NULL, background transition probabilities are estimated from the sequences.
qe: an optional named vector of background emission probabilities the same length as the residue alphabet (i.e. 4 for nucleotides and 20 for amino acids) and with corresponding names (i.e. c("A", "T", "G", "C") for DNA). If qe = NULL, background emission probabilities are automatically derived from the sequences.
maxsize: integer giving the upper bound on the number of modules in the PHMM. If NULL (default) no maximum size is enforced.
inserts: character string giving the model construction method by which alignment columns are marked as either match or insert states. Accepted methods include "threshold" (only columns with fewer than a specified proportion of gaps form match states in the model), "map" (default; match and insert columns are found using the maximum a posteriori method outlined in Durbin et al (1998) chapter 5.7), "inherited" (match and insert columns are inherited from the input alignment), and "none" (all columns are assigned match states in the model). Alternatively, insert columns can be specified manually by providing a logical vector the same length as the number of columns in the alignment, with TRUE for insert columns and FALSE for match states.
threshold: the maximum proportion of gaps for an alignment column to be considered for a match state in the PHMM (defaults to 0.5). Only applicable when inserts = "threshold". Note that the maximum a posteriori method works poorly for alignments with few sequences, so the 'threshold' method is automatically used when the number of sequences is less than 5.
lambda: penalty parameter used to favour models with fewer match states. Equivalent to the log of the prior probability of marking each column (Durbin et al 1998, chapter 5.7). Only applicable when inserts = "map".
DI: logical indicating whether delete-insert transitions should be allowed in the profile hidden Markov model (if applicable). Defaults to FALSE.
ID: logical indicating whether insert-delete transitions should be allowed in the profile hidden Markov model (if applicable). Defaults to FALSE.
omit.endgaps: logical. Should gap characters at each end of the sequences be ignored when deriving the transition probabilities of the model? Defaults to FALSE. Set to TRUE if x is not a strict global alignment (i.e. if the alignment contains partial sequences with missing sections represented with gap characters).
name: an optional character string. The name of the new profile hidden Markov model.
description: an optional character string. The description of the new profile hidden Markov model.
compo: logical indicating whether the average emission probabilities of the model modules should be returned with the PHMM object.
consensus: placeholder. Consensus sequences will be available in a future version.
alignment: logical indicating whether the alignment used to derive the final model (if applicable) should be included as an element of the returned PHMM object. Defaults to FALSE.
progressive: logical indicating whether the alignment used to derive the initial model parameters should be built progressively (assuming input is a list of unaligned sequences, ignored otherwise). Defaults to FALSE, in which case the longest sequence or sequences are used (faster, but possibly less accurate).
seeds: optional integer vector indicating which sequences should be used as seeds for building the guide tree for the progressive alignment (assuming input is a list of unaligned sequences, and progressive = TRUE, ignored otherwise). Defaults to NULL, in which a set of log(n, 2)^2 non-identical sequences are chosen from the list of sequences by k-means clustering.
refine: the method used to iteratively refine the model parameters following the initial progressive alignment and model derivation step. Current supported options are "Viterbi" (Viterbi training; the default option), "BaumWelch" (a modified version of the Expectation-Maximization algorithm), and "none" (skips the model refinement step).
maxiter: the maximum number of EM iterations or Viterbi training iterations to carry out before the cycling process is terminated and the partially trained model is returned. Defaults to 100.
deltaLL: numeric, the maximum change in log likelihood between EM iterations before the cycling procedure is terminated (signifying model convergence). Defaults to 1E-07. Only applicable if method = "BaumWelch".
cpp: logical, indicates whether the dynamic programming matrix should be filled using compiled C++ functions (default; many times faster). The FALSE option is primarily retained for bug fixing and experimentation.
cores: integer giving the number of CPUs to parallelize the operation over. Defaults to 1, and reverts to 1 if x is not a list. This argument may alternatively be a 'cluster' object, in which case it is the user's responsibility to close the socket connection at the conclusion of the operation, for example by running parallel::stopCluster(cores). The string 'autodetect' is also accepted, in which case the maximum number of cores to use is one less than the total number of cores available. Note that in this case there may be a tradeoff in terms of speed depending on the number and size of sequences to be aligned, due to the extra time required to initialize the cluster.
quiet: logical indicating whether feedback should be printed to the console.

Author

Shaun Wilkinson

Details

This function performs a similar operation to the hmmbuild function in the HMMER package. If the primary input argument is an alignment, the function creates a profile hidden Markov model (object class:"PHMM") using the method described in Durbin et al (1998) chapter 5.3. Alternatively, if a list of non-aligned sequences is passed, the sequences are first aligned using the align function before being used to derive the model.

The function outputs an object of class "PHMM", which is a list consisting of emission and transition probability matrices (elements named "E" and "A"), vectors of non-position-specific background emission and transition probabilities ("qe" and "qa", respectively) and other model metadata including "name", "description", "size" (the number of modules in the model), and "alphabet" (the set of symbols/residues emitted by the model).

References

Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.

Examples

Run this code

## Small globin alignment data from Durbin et al (1998) Figure 5.3
data(globins)
## derive a profile hidden Markov model from the alignment
globins.PHMM <- derivePHMM(globins, residues = "AMINO", seqweights = NULL)
plot(globins.PHMM, main = "Profile HMM for small globin alignment")
##
## derive a profle HMM from the woodmouse dataset in the
## ape package and plot the first 5 modules
library(ape)
data(woodmouse)
woodmouse.PHMM <- derivePHMM(woodmouse)
plot(woodmouse.PHMM, from = 0, to = 5, main = "Partial woodmouse profile HMM")

Run the code above in your browser using DataLab