Uses the EM algorithm to perform a maximum likelihood fit of a hidden Markov model to discrete data where the observations come from one of a number of finite discrete distributions, depending on the (hidden) state of the Markov chain. These distributions are specified (non-parametrically) by a matrix \(R = [\rho_{ij}]\) where \(\rho_{ij} = P(Y = y_i | S = j)\), \(Y\) being the observable random variable and \(S\) being the hidden state.
hmm(y, yval=NULL, par0=NULL, K=NULL, rand.start=NULL, stationary=cis,
mixture=FALSE, cis=TRUE, tolerance=1e-4, verbose=FALSE,
itmax=200, crit='PCLL',keep.y=TRUE, data.name=NULL)
A vector of discrete data or a list of such vectors; missing values are allowed.
A vector of possible values for the data; it defaults to the
sorted unique values of y
. If any value of y
does not match some value of yval
, it will be treated as
a MISSING VALUE.
An optional (named) list of starting values for the
parameters of the model, with components tpm
(transition
probability matrix) and Rho
. The matrix Rho
specifies the probability that the observations take on each
value in yval, given the state of the hidden Markov chain.
The columns of Rho
correspond to states, the rows to the
values of yval
.
If par0$tpm
has row and column names these must be the
same (otherwise an error is thrown). If par0$Rho
has
column names these must be the same as the row and column names
of par0$tpm
.
If par0
is not specified, starting values are created by
the function init.all()
.
The number of states in the hidden Markov chain; if par0
is
not specified K
MUST be; if par0
is specified, K
is ignored.
Note that K=1
is acceptable; if K
is 1 then
all observations are treated as being independent and the
non-parametric estimate of the distribution of the observations
is calculated in the obvious way.
A list consisting of two logical scalars which must be named
tpm
and Rho
, if tpm
is TRUE then the function
init.all() chooses entries for then starting value of tpm
at random; likewise for Rho
. This argument defaults to
list(tpm=FALSE,Rho=FALSE)
.
Logical scalar. If TRUE
then the model is fitted under
the stationarity assumption, i.e. that the Markov chain was in
steady state at the time that observations commenced. In this
case the initial state probability distribution is estimated
as the stationary distribution determined by the (estimated)
transition probability matrix. Otherwise if cis
(see
below) is TRUE
the initial state probability distribution
is estimated as the mean of the vectors of conditional
probabilities of the states, given the observation sequences,
at time t=1
. If stationary
is TRUE
and
cis
is FALSE
an error is given.
A logical scalar; if TRUE then a mixture model (all rows of the transition probability matrix are identical) is fitted rather than a general hidden Markov model.
A logical scalar specifying whether there should be a
constant initial state probability
distribution. If stationary
is FALSE
and cis
is FALSE
then the initial state probability distribution
for a given observation sequence is equal to 1 where the (first)
maximum of the vector of conditional probabilities of the states,
given the observation sequences, at time t=1
, occurs,
and is 0 elsewhere. If stationary
is TRUE
and
cis
is FALSE
an error is given.
If the value of the quantity used for the stopping criterion is less than tolerance then the EM algorithm is considered to have converged.
A logical scalar determining whether to print out details of the progress of the EM algorithm.
If the convergence criterion has not been met by the time
itmax
EM steps have been performed, a warning message
is printed out, and the function stops. A value is returned
by the function anyway, with the logical component "converged"
set to FALSE.
The name of the stopping criterion, which must be one of "PCLL" (percent change in log-likelihood; the default), "L2" (L-2 norm, i.e. square root of sum of squares of change in coefficients), or "Linf" (L-infinity norm, i.e. maximum absolute value of change in coefficients).
Logical scalar; should the observations y
be returned as
a component of the value of this function?
An identifying tag for the fit; if omitted, it defaults
to the name of data set y
as determined by
deparse(substitute(y))
.
A list with components:
The fitted value of the probability matrix Rho
specifying
the distributions of the observations (the “emission”
probabilities).
The fitted value of the transition probabilty matrix tpm
.
The fitted initial state probability distribution, or a matrix of
(trivial or “deterministic”) initial state probability
distributions, one (column) of ispd
for each observation
sequence.
If stationary
is TRUE
then ispd
is assumed
to be the (unique) stationary distribution for the chain,
and thereby determined by the transition probability matrix
tpm
. If stationary
is FALSE
and cis
is TRUE
then ispd
is estimated as the mean of the
vectors of conditional probabilities of the states, given the
observation sequences, at time t=1
.
If cis
is FALSE
then ispd
is a matrix
whose columns are trivial probability vectors, as described
above.
The final value of the log likelihood, as calculated through recursion.
A logical scalar saying whether the algorithm satisfied the convergence criterion before the maximum of itmax EM steps was exceeded.
The number of EM steps performed by the algorithm.
The observations (argument y
). Present only if
keep.y
is TRUE
.
An identifying tag, specified as an argument, or determined from the name of the argument y by deparse(substitute(y)).
The argument stationary
.
The argument cis
.
The package used to require the argument y
to
be a matrix in the case of multiple observed series.
If the series were of unequal length the user was expected to
pad them out with NAs to equalize the lengths. In the revision
from version 0.0-9 to 0.1-0 this was changed, requiring the
argument y
to be (more sensibly) a list when there are
multiple series. (Unfortunately this help page was not
correspondingly revised at the time of that transition nor of
the next one, and the old ``matrix'' format was left as being
the specified input format until the package was updated to
version 0.1-2.)
The old matrix format is still permitted (and the matrix
is internally changed into a list) but this is deprecated.
In some future version of hmm.discnp
this possibility
will be removed.
If K=1
then tpm
, ispd
, converged
,
and nstep
are all set equal to NA
in the list
returned by this function.
The estimate of ispd
in the non-stationary setting
is inevitably very poor, unless the number of sequences of
observations (the length of the list y
) is very large.
We have in effect ``less than one'' relevant observation for
each such sequence.
The returned values of tpm
and Rho
have row
and column names. These are the same as the row and column
names of the starting values of these matrices (as provided
in par0
is these exist. Otherwise they are taken to
be the appropriate sequences of integers. Likewise the
returned value of ispd
is a named vector, the names
being the same as the row (and column) names of tpm
and the column names of Rho
.
The ordering of the (hidden) states can be arbitrary. What the estimation procedure decides to call ``state 1'' may not be what you think of as being state number 1. The ordering of the states will be affected by the starting values used.
At some time in the future the (deprecated) option of being
able to specify argument y
as a matrix (in the setting
in which there are multiple data series to which a single model
is being fitted) will no longer be permitted.
The hard work is done by a Fortran subroutine "recurse" (actually coded in Ratfor) which is dynamically loaded.
Rabiner, L. R., "A tutorial on hidden Markov models and selected applications in speech recognition," Proc. IEEE vol. 77, pp. 257 -- 286, 1989.
Zucchini, W. and Guttorp, P., "A hidden Markov model for space-time precipitation," Water Resources Research vol. 27, pp. 1917-1923, 1991.
MacDonald, I. L., and Zucchini, W., "Hidden Markov and Other Models for Discrete-valued Time Series", Chapman & Hall, London, 1997.
Liu, Limin, "Hidden Markov Models for Precipitation in a Region of Atlantic Canada", Master's Report, University of New Brunswick, 1997.
# NOT RUN {
# 1.
Yval <- LETTERS[1:10]
Tpm <- matrix(c(0.75,0.25,0.25,0.75),ncol=2,byrow=TRUE)
Rho <- cbind(c(rep(1,5),rep(0,5)),c(rep(0,5),rep(1,5)))/5
rownames(Rho) <- Yval
set.seed(42)
xxx <- sim.hmm(rep(1000,5),tpm=Tpm,Rho=Rho,yval=Yval)
fit <- hmm(xxx,par0=list(tpm=Tpm,Rho=Rho))
print(fit$Rho)
# 2.
# See the help for sim.hmm() for how to generate y.num.
# }
# NOT RUN {
fit.num <- hmm(y.num,K=2,verb=TRUE)
fit.num.mix <- hmm(y.num,K=2,verb=TRUE,mixture=TRUE)
print(fit.num[c("tpm","Rho")])
# }
# NOT RUN {
# Note that states 1 and 2 get swapped.
# 3.
xxx <- with(colifCount,split(y,f=locn))
fitColCount <- hmm(xxx,K=2) # Two states: above and below the thermocline.
# 4.
fitLesCount <- hmm(lesionCount,K=2) # Two states: relapse and remission.
# }
Run the code above in your browser using DataLab