hmm: Fit a hidden Markov model to discrete data.

Description

Effects a maximum likelihood fit of a hidden Markov model to discrete data where the observations come from one of a number of finite discrete distributions, depending on the (hidden) state of the Markov chain. These distributions (the “emission probabilities”) are specified non-parametrically. The observations may be univariate, independent bivariate, or dependent bivariate. By default this function uses the EM algorithm. In the univariate setting it may alternatively use a “brute force” method.

Usage

hmm(y, yval=NULL, par0=NULL, K=NULL, rand.start=NULL,
    method=c("EM","bf","LM","SD"), hglmethod=c("fortran","oraw","raw"),
    optimiser=c("nlm","optim"), optimMethod=NULL, stationary=cis,
    mixture=FALSE, cis=TRUE, indep=NULL, tolerance=1e-4, digits=NULL,
    verbose=FALSE, itmax=200, crit=c("PCLL","L2","Linf","ABSGRD"),
    keep.y=FALSE, keep.X=keep.y, newstyle=TRUE, X=NULL,
    addIntercept=TRUE, lmc=10, hessian=FALSE,...)

Arguments

A one or two column matrix of discrete data or a list of such matrices; missing values are allowed. If y is a vector, or list of vectors (of discrete data) these vectors are coerced to one column matrices.

yval

A vector (of length m, say) of possible values for the data or a list of two such vectors (of lengths m1 and m2, say, one for each of the two variates in the bivarate settings). These vectors default to the sorted unique values of the respective variates as provided in y. If yval is supplied and any value of y does not match some value of yval, then an error is thrown.

par0

An optional (named) list of starting values for the parameters of the model, with components tpm (transition probability matrix), optionally ispd (initial state probability distribution) and Rho. The object Rho specifies the probability that the observations take on each of the possible values of the variate or variates, given the state of the hidden Markov chain. See Details. Note that in the case of independent bivariate data Rho is a list of two matrices. These matrices may (and in general will) have different row dimensions, but must have identical column dimensions (equal to K, the number of states; see below).

If the model is stationary (i.e. if stationary is TRUE) then you should almost surely not specify the ispd component of par0. If you do specify it, it really only makes sense to specify it to be the stationary distribution determined by tpm and this is a waste of time since this is what the code will take ispd to be if you leave it unspecified.

If par0 is not specified, starting values are created by the function init.all().

The number of states in the hidden Markov chain; if par0 is not specified K MUST be; if par0 is specified, K is ignored.

Note that K=1 is acceptable; if K is 1 then all observations are treated as being independent and the non-parametric estimate of the distribution of the observations is calculated in the dQuoteobvious way.

rand.start

A list consisting of two logical scalars which must be named tpm and Rho, if tpm is TRUE then the function init.all() chooses entries for then starting value of tpm at random; likewise for Rho. This argument defaults to list(tpm=FALSE,Rho=FALSE).

method

Character string, either "bf", "EM", "LM" or "SD" (i.e. use numerical maximisation via either nlm() or optim(), the EM algorithm, the Levenberg-Marquardt algorithm, or the method of steepest descent). May be abbreviated. Currently the "bf", "LM" and "SD" methods can be used only in the univariate setting, handle only stationary models (see below) and do not do mixtures.

hglmethod

Character string; one of "fortran", "oraw" or "raw". May be abbreviated. This argument determines the procedure by which the hessian, gradient and log likelihood of the model and data are calculated. If this is argument is equal to "fortran" (the default) then (obviously!) dynamically loaded fortran subroutines are used. The other two possibilities effect the calculations in raw R; "oraw" (“o” for “original” uses code that is essentially a direct transcription of the fortran code, do-loops being replaced by for-loops. With method "raw" the for-loops are eliminated and matrix-vector calculations are applied. The "oraw" method is about 25 times slower than the "fortran" method and the "raw" method is more than 30 times slower. The “raw” methods are present mainly for debugging purposes and would not usually be used in practice. This argument is used only if the method is "LM" or "SD" (and is involved only peripherally in the latter instance). It is ignored otherwise.

optimiser

Character string specifying the optimiser to use when the “"bf"” method of optimisation is chosen. It should be one of "nlm" or "optim", and may be abbreviated. Ignored unless method="bf".

optimMethod

Character string specifying the optimisation method to be used by optim(). Should be one of "Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN", or "Brent". Ignored if the method is not "bf" or if the optimiser is not "optim".

stationary

Logical scalar. If TRUE then the model is fitted under the stationarity assumption, i.e. that the Markov chain was in steady state at the time that observations commenced. In this case the initial state probability distribution is estimated as the stationary distribution determined by the (estimated) transition probability matrix. Otherwise if cis (see below) is TRUE the initial state probability distribution is estimated as the mean of the vectors of conditional probabilities of the states, given the observation sequences, at time t=1. If stationary is TRUE and cis is FALSE an error is thrown. Currently if the method is "bf", "LM" or "SD", and stationary is FALSE, then an error is thrown.

mixture

A logical scalar; if TRUE then a mixture model (all rows of the transition probability matrix are identical) is fitted rather than a general hidden Markov model. Currently an error is thrown if mixture=TRUE and the method is "bf", "LM" or "SD".

cis

A logical scalar specifying whether there should be a constant initial state probability distribution. If stationary is FALSE and cis is FALSE then the initial state probability distribution for a given observation sequence is equal to 1 where the (first) maximum of the vector of conditional probabilities of the states, given the observation sequences, at time t=1, occurs, and is 0 elsewhere. If stationary is TRUE and cis is FALSE an error is given.

indep

Logical scalar. Should the bivariate model be fitted under the assumption that the two variables are (conditionally) independent give the state? If this argument is left as NULL its value is inferred from the structure of Rho in par0 if the latter is supplied. If the data are bivariate and neither indep nor par0 is supplied, then an error is given. If the data are bivariate and if the value of indep is inconsistent with the structure of par0$Rho then an error is given. If the data are univariate then indep is ignored.

tolerance

If the value of the quantity used for the stopping criterion is less than tolerance then the algorithm is considered to have converged. Ignored if method="bf". Defaults to 1e-4.

digits

Integer scalar. The number of digits to which to print out “progress reports” (when verbose is TRUE). There is a “sensible” default (calculated from tolerance). Not used if the method is "bf".

verbose

A logical scalar determining whether to print out details of the progress of the algorithm. If the method is "EM", "LM" or "SD" then when verbose is TRUE information about the convergence criteria is printed out at every step that the algorithm takes. If method="bf" then the value of verbose determines the value of the argument print.level of nlm() or the value of the argument trace of optim(). In the first case, if verbose is TRUE then print.level is set to 2, otherwise it is set to 0. In the second case, if verbose is TRUE then trace is set to 6, otherwise it is set to 0.

itmax

When the method is "EM", "LM" or "SD" this is the maximum number of steps that the algorithm takes. If the convergence criterion has not been met by the time itmax steps have been performed, a warning message is printed out, and the function stops. A value is returned by the function anyway, with the logical component "converged" set to FALSE. When method="bf" the itmax argument is passed to nlm() as the value of iterlim or to optim() as the value of maxit. If the (somewhat obscure) convergence criteria of nlm() or optim() have not been met by the time itmax “iterations” have been performed, the algorithm ceases. In this case, if nlm() is used. the value of code in the object returned set equal to 4 and if optim() is used then the value of convergence returned is set equal to 1. Note that the value of code, respectively convergence is returned as the converged component of the object returned by hmm(). A value of 1 indicates successful completion of the nlm() procedure. A value of 0 indicates successful completion of the optim() procedure.

crit

The name of the stopping criterion used. When method="EM" it must be one of "PCLL" (percent change in log-likelihood; the default), "L2" (L-2 norm, i.e. square root of sum of squares of change in coefficients), or "Linf" (L-infinity norm, i.e. maximum absolute value of change in coefficients). When method="LM" or method="SD" there is a fourth possibility, namely "ABSGRD" the (maximum) absolute value of the gradient. It may not be advisable to use this criterion in the current context (i.e. that of discrete non-parametric distributions). See Warnings. This argument defaults to "PCLL". It is ignored if method="bf". (The nlm() and optim() functions have their own obscure stopping criteria.)

keep.y

Logical scalar; should the observations y be returned as a component of the value of this function?

keep.X

Logical scalar; should the predictors X be returned as a component of the value of this function? Note that the value of keep.X will be silently set equal to FALSE unless it actually “makes sense” to keep X. I.e. unless the observations are univariate, newstyle is TRUE, and X is actually supplied, i.e. is not NULL.

newstyle

Logical scalar; if TRUE then the “new style” of of parameterising Rho (see Details) is used, in the univariate setting. Note that if X is not NULL or it the method is not "EM" then newstyle cannot be set to FALSE; if it is, then an error is thrown.

An optional numeric matrix, or a list of such matrices, of predictors. The use of such predictors is (currently, at least) applicable only in the univariate emissions setting, and then only if Rho is supplied in the “newstyle” format. If X is a list it must be of the same length as y and all entries of this list must have the same number of columns. If the columns of any entry of the list are named, then they must be named for all entries, and the column names must be the same for all entries. The number of rows of each entry must be equal to the length of the corresponding entry of y. If X is a matrix then y should be a vector or one-column matrix (or a list with a single entry equal to such).

There may be at most one constant column in X or the components thereof. If there are any constant columns there must be precisely one (in all components of X), it must be the first column and all of its entries must be equal to 1. If the columns have names, the names of this first column must be "Intercept".

Note that X (or its entries) must be a numeric matrix (or must be matrices) --- no data frames! Factor predictors are not permitted. It may be possible to use factor predictors by supplying X or its entries as the output of model.matrix(); this will depend on circumstances.

addIntercept

Logical scalar. Should a column of ones, corresponding to an intercept term, be prepended to each of of the matrices in the list X? (The user should remember that this argument defaults to TRUE.)

lmc

Numeric scalar. The (initial) “Levenberg-Marquardt correction” parameter. Used only if method="LM", otherwise ignored.

hessian

Logical scalar. Should the hessian matrix be returned? This argument is relevant only if method="bf" (in which case it is passed along to hmmNumOpt()) and is ignored otherwise. This argument should be set to TRUE only if you really want the hessian matrix. Setting it to TRUE causes a substantial delay between the time when hmm() finishes its iterations and when it actually returns a value.

…

Additional arguments passed to hmmNumOpt(). There is one noteworthy argument useAnalGrad which is used “directly” by hmmNumOpt(). This argument is a logical scalar and if it is TRUE then calls to nlm() or optim() are structured so that an analytic calculation of the gradient vector (implemented by the internal function get.gl() is applied. If it is FALSE then finite difference methods are used to calculate the gradient vector. If this argument is not specified it defaults to FALSE. Note that the name of this argument cannot be abbreviated.

Other “additional arguments” may be supplied for the control of nlm() and are passed on appropriately to nlm(). These are used only if method="bf" and if optimiser="nlm". These “…” arguments might typically include gradtol, stepmax and steptol. They should NOT include print.level or iterlim. The former argument is automatically passed to nlm() as 0 if verbose is FALSE and as 2 if verbose is TRUE. The latter argument is automatically passed to nlm() with the value of itmax.

Value

A list with components:

Rho

The fitted value of the matrix, data frame, list of two matrices, or array Rho specifying the distributions of the observations (the “emission” probabilities).

tpm

The fitted value of the transition probabilty matrix tpm.

ispd

The fitted initial state probability distribution, or a matrix of (trivial or “deterministic”) initial state probability distributions, one (column) of ispd for each observation sequence.

If stationary is TRUE then ispd is assumed to be the (unique) stationary distribution for the chain, and thereby determined by the transition probability matrix tpm. If stationary is FALSE and cis is TRUE then ispd is estimated as the mean of the vectors of conditional probabilities of the states, given the observation sequences, at time t=1.

If cis is FALSE then ispd is a matrix whose columns are trivial probability vectors, as described above.

log.like

The final (maximal, we hope!) value of the log likelihood, as determined by the maximisation procedure.

grad

The gradient of the log likelihood. Present only if the method is "LM" or "bf" and in the latter case then only if the optimiser is nlm().

hessian

The hessian of the log likelihood. Present only if the method is "LM" or "bf".

stopCrit

A vector of the (final) values of the stopping criteria, with names "PCLL", "L2", "Linf" unless the method is "LM" or "SD" in which case this vector has a fourth entry named "ABSGRD".

par0

The starting values used by the algorithms. Either the argument par0, or a similar object with either or both components (tpm and Rho) being created by rand.start().

npar

The number of parameters in the fitted model. Equal to nispar + ntpmpar + nrhopar where (1) nispar is 0 if stationary is TRUE and is K-1 otherwise; (2) ntpmpar is K*(K-1) (3) nrhopar is

K*(nrow(Rho)-1) for univariate models with newstyle equal to FALSE
(nrow(Rho) - K)*(ncol(Rho)-2) for univariate models with newstyle equal to TRUE
K*(sum(sapply(Rho,nrow))-K) for bivariate independent models
prod(dim(Rho))-K for bivariate dependent models.

bicm

Numeric scalar. The number by which npar is multiplied to form the BIC criterion. It is essentially the log of the number of observations. See the code of hmm() for details.

converged

A logical scalar indicating whether the algorithm converged. If the EM, LM or steepest descent algorithm was used it simply indicates whether the stopping criterion was met before the maximum number (itmax) of steps was exceeded. If method="bf" then converged is based on the code component of the object returned by the optimiser when nlm() was used, or on the convergence component when optim() was used. In these cases converged has an attribute (code or convergence respectively) giving the (integer) value of the relevant component.

Note that in the nlm() case a value of code equal to 2 indicates “probable” convergence, and a value of 3 indicates “possible” covergence. However in this context converged is set equal to TRUE only if code is 1.

nstep

The number of steps performed by the algorithm if the method was "EM", "LM" or "SD". The value of nstep is set equal to the iterations component of the value returned by nlm() if method="bf".

prior.emsteps

The number of EM steps that were taken before the method was switched from "EM" to "bf" or to "LM". Present only in values returned under the "bf" or "LM" methods after a switch from "EM" and is equal to 0 if either of these methods was specified in the initial call (rather than arising as the result of a switch).

ylengths

Integer vector of the lengths of the observation sequences (number of rows if the observations are in the form of one or two column matrices).

nafrac

A real number between 0 and 1 or a pair (two dimensional vector) of such numbers. Each number is the the fraction of missing values if the corresponding components of the observations.

A (slighty tidied up) version of the observations (argument y). Present only if keep.y is TRUE.

A (slighty tidied up) version of the predictor matrix or list of predictor matrices (argument X). Present only if X is supplied, is an appropriate argument, and if keep.X is TRUE.

parity

Character string; "univar" if the data were univariate, "bivar" if they were bivariate.

numeric

Logical scalar; TRUE if the (original) data were numeric, FALSE otherwise.

AIC

The value of AIC = -2*log.like + 2*npar for the fitted model.

BIC

The value of BIC = -2*log.like + log(nobs)*npar for the fitted model. In the forgoing nobs is the number of observations. This is the number of non-missing values in unlist(y) in the univariate setting and one half of this number in the bivariate setting.

args

A list of argument values supplied. This component is returned in the interest of making results reproducible. It is also needed to facilitate the updating of a model via the update method for the class hmm.discnp, update.hmm.discnp().

It has components:

newstyle
method
optimiser
optimMethod
stationary
mixture
cis
tolerance
itmax
crit
addIntercept

Warnings

The ordering of the (hidden) states can be arbitrary. What the estimation procedure decides to call “state 1” may not be what you think of as being state number 1. The ordering of the states will be affected by the starting values used.

Some experiences with using the "ABSGRD" stopping criterion indicate that it may be problematic in the context of discrete non-parametric distributions. For example a value of 1854.955 was returned after 200 LM steps in one (non-convergent, of course!) attempt at fitting a model. The stopping criterion "PCLL" in this example was took the “reasonable” value of 0.03193748 when iterations ceased.

Notes

The package used to require the argument y to be a matrix in the case of multiple observed sequences. If the series were of unequal length the user was expected to pad them out with NAs to equalize the lengths.

The old matrix format for multiple observation sequences was permitted for a while (and the matrix was internally changed into a list) but this is no longer allowed. If y is indeed given as a matrix then this corresponds to a single observation sequence and it must have one (univariate setting) or two (bivariate setting) columns which constitute the observations of the respective variates.

If K=1 then tpm, ispd, converged, and nstep are all set equal to NA in the list returned by this function.

The estimate of ispd in the non-stationary setting is inevitably very poor, unless the number of sequences of observations (the length of the list y) is very large. We have in effect “less than one” relevant observation for each such sequence.

The returned values of tpm and Rho (or the entries of Rho when Rho is a list) have dimension names. These are the same as the dimension names of the starting values of these objects (as provided in par0) if these exist. Otherwise they are formed from the sorted unique values of the observations in y or are taken to be the appropriate sequences of integers. Likewise the returned value of ispd is a named vector, the names being the same as the row (and column) names of tpm and the corresponding dimension names of Rho or of its entries.

If method is equal to "EM" it may get switched to "bf" at some EM step if there is a decrease in the log likelihood. This is “theoretically impossible” but can occur in practice due to an intricacy in the way that the EM algorithm treats ispd when stationary is TRUE. It turns out to be effectively impossible to maximise the expected log likelihood unless the term in that quantity corresponding to ispd is ignored (whence it is ignored). Ignoring this term is “asymptotically negligible” but can have the unfortunate effect of occasionally leading to a decrease in the log likelihood.

If such a decrease is detected, then the algorithm switches over to using the "bf" method of maximisation (which is not beset by this difficulty). The current estimates of ispd, tpm and Rho are used as starting values for the "bf" method.

It seems to me that it should be the case that such switching can occur only if stationary is TRUE. However I have encountered instances in which the switch occurs when stationary is FALSE. I have yet to figure out/track down what is going on here.

Details

In the univariate case the emission probabilities are, in the case in which newstyle is TRUE (the default), specified by means of a data frame Rho. The first column of Rho, named y, is a factor consisting of the possible values of the emissions, repeated K times (where K is the number of states). The second column, named states, is a factor consisting of integer values 1, 2, …, K. Each of these values is repeated m times where m is the length of yval. Further columns of Rho are numeric and consist of coefficients of the linear predictor of the probabilities of the various values of y. If X is NULL then Rho has only one further column named Intercept.

If X is not NULL then the Intercept column is present only if addIntercept is TRUE. There as many (other, in addition to the possible Intercept column) numeric columns as there are columns in X or in the matrices in the list X. The names of these columns are taken to be the column names of X or the first entry of X if such column names are present. Otherwise the names default to V1, V2 ….

The probabilities of the emissions taking on their various possible values are given by $$\Pr(Y = y_i | \boldsymbol{x}, \textrm{state}=S) = \ell_i/\sum_{j=1}^m \ell_j$$ where $\ell_j$ is the $j\textrm{th}$ entry of $\boldsymbol{\beta}^{\top}\boldsymbol{x}$ and where in turn $\boldsymbol{x}$ is the vector of predictors and $\boldsymbol{\beta}$ is the coefficient vector in the linear predicator that corresponds to $y_i$ and the hidden state $S$. For identifiability the vectors $\boldsymbol{\beta}$ corresponding to the first value of $Y$ (the first level of Rho$y) are set equal to the zero vector for all values of the state $S$.

If newstyle is FALSE Rho is specified by a matrix $\textrm{Rho} = [\rho_{ij}]$ where $\rho_{ij} = \Pr(Y = y_i | S = j)$. If newstyle is FALSE and if X is not NULL then an error is thrown.

In the independent bivariate case the emission probabilities are specified by a list of two such matrices. In this setting $P(Y_1,Y_2) = (y_{i1},y_{i2}) | S = j) = \rho^{(1)}_{i_1,j} \rho^{(2)}_{i_2,j}$ where $R^{(k)} = [\rho^{(k)}_{ij}]$ ($k = 1,2$) are the two emission probability matrices. In the dependent bivariate case the emission probabilities are specified by a three dimensional array. In this setting $P((Y_1,Y_2) = (y_{i1},y_{i2}) | S = j) = \rho_{i_1,i_2,j}$ where $R_k = [\rho^{(k)}_{ij}]$ is the emission probability array.

The hard work of calculating the recursive probabilities used to fit the model is done by a Fortran subroutine "recurse" (actually coded in Ratfor) which is dynamically loaded. In the univariate case, when X is provided, the estimation of the “linear predictor” vectors $\boldsymbol{\beta}$ is handled by the function multinom() from the nnet package. Note that this is a “Recommended” package and is thereby automatically available (i.e. does not have to be installed).

References

Rabiner, L. R., "A tutorial on hidden Markov models and selected applications in speech recognition," Proc. IEEE vol. 77, pp. 257 -- 286, 1989.

Zucchini, W. and Guttorp, P., "A hidden Markov model for space-time precipitation," Water Resources Research vol. 27, pp. 1917-1923, 1991.

MacDonald, I. L., and Zucchini, W., "Hidden Markov and Other Models for Discrete-valued Time Series", Chapman & Hall, London, 1997.

Liu, Limin, "Hidden Markov Models for Precipitation in a Region of Atlantic Canada", Master's Report, University of New Brunswick, 1997.

Examples

Run this code

# NOT RUN {
# TO DO: Create one or more bivariate examples.
#
# The value of itmax in the following examples is so much
# too small as to be risible.  This is just to speed up the
# R CMD check process.
# 1.
Yval <- LETTERS[1:10]
Tpm  <- matrix(c(0.75,0.25,0.25,0.75),ncol=2,byrow=TRUE)
Rho  <- cbind(c(rep(1,5),rep(0,5)),c(rep(0,5),rep(1,5)))/5
rownames(Rho) <- Yval
set.seed(42)
xxx  <- rhmm(ylengths=rep(1000,5),nsim=1,tpm=Tpm,Rho=Rho,yval=Yval,drop=TRUE)
fit1  <- hmm(xxx,par0=list(tpm=Tpm,Rho=Rho),newstyle=FALSE,itmax=10)
print(fit1$Rho)
newRho  <- cnvrtRho(Rho)
fit2  <- hmm(xxx,par0=list(tpm=Tpm,Rho=newRho),itmax=10)
print(round(cnvrtRho(fit2$Rho),6))
# Agrees with fit1$Rho to 5 decimal places; the difference
# is due to the fitting procedure used with the "new style"
# Rho, which calls upon multinom() from the nnet package.

# 2.
# See the help for rhmm() for how to generate y.num.
# }
# NOT RUN {
   fit.num     <- hmm(y.num,K=2,verb=TRUE,itmax=10)
   fit.num.mix <- hmm(y.num,K=2,verb=TRUE,mixture=TRUE,itmax=10)
   print(fit.num[c("tpm","Rho")])
# }
# NOT RUN {
# Note that states 1 and 2 get swapped.

# 3.
xxx <- with(SydColDisc,split(y,f=list(locn,depth)))
fitSydCol <- hmm(xxx,K=2,itmax=10) # Two states: above and below the thermocline.

# 4.
fitLesCount <- hmm(lesionCount,K=2,itmax=10) # Two states: relapse and remission.
# }

Run the code above in your browser using DataLab