Effects a maximum likelihood fit of a hidden Markov model to discrete data where the observations come from one of a number of finite discrete distributions, depending on the (hidden) state of the Markov chain. These distributions (the “emission probabilities”) are specified non-parametrically. The observations may be univariate, independent bivariate, or dependent bivariate. By default this function uses the EM algorithm. In the univariate setting it may alternatively use a “brute force” method.
hmm(y, yval=NULL, par0=NULL, K=NULL, rand.start=NULL,
method=c("EM","bf","LM","SD"), hglmethod=c("fortran","oraw","raw"),
optimiser=c("nlm","optim"), optimMethod=NULL, stationary=cis,
mixture=FALSE, cis=TRUE, indep=NULL, tolerance=1e-4, digits=NULL,
verbose=FALSE, itmax=200, crit=c("PCLL","L2","Linf","ABSGRD"),
X=NULL,keep.y=FALSE, keep.X=keep.y,
addIntercept=TRUE, lmc=10, hessian=FALSE,...)
A list with components:
The fitted value of the data frame, list of two matrices,
or array Rho
(in the case of a univariate model, a
bivariate independent model or a bivariate dependent model
respectively) specifying the distributions of the observations
(the “emission” probabilities).
Present only in the univariate setting. A matrix
whose entries are the (fitted) emission probabilities,
row corresponding to values of the emissions and columns
to states. The columns sum to 1. This component provides
the same information as Rho
, but in a more readily
interpretable form.
The fitted value of the transition probability matrix tpm
.
Logical scalar; the value of the stationary
argument.
The fitted initial state probability distribution, or a matrix
of initial state probability distributions, one (column) of
ispd
for each observation sequence.
If stationary
is TRUE
then ispd
is assumed
to be the (unique) stationary distribution for the chain,
and thereby determined by the transition probability matrix
tpm
. If stationary
is FALSE
and cis
is TRUE
then ispd
is estimated as the mean of the
vectors of conditional probabilities of the states, given the
observation sequences, at time t=1
.
If cis
is FALSE
then ispd
is a matrix
whose columns are the vectors of conditional probabilities of
the states, given the observation sequences, at time t=1
,
as described above. (If there is only one observation sequence,
then this --- one-column --- matrix is converted into a vector.)
The final (maximal, we hope!) value of the log likelihood, as determined by the maximisation procedure.
The gradient of the log likelihood. Present only if the
method is "LM"
or "bf"
and in the latter
case then only if the optimiser is nlm()
.
The hessian of the log likelihood. Present only if the
method is "LM"
or "bf"
.
A vector of the (final) values of the stopping criteria, with
names "PCLL"
, "L2"
, "Linf"
unless the method
is "LM"
or "SD"
in which case this vector has a
fourth entry named "ABSGRD"
.
The starting values used by the algorithms. Either the argument
par0
, or a similar object with either or both components
(tpm
and Rho
) being created by rand.start()
.
The number of parameters in the fitted model. Equal to
nispar + ntpmpar + nrhopar
where (1) nispar
is
0
if stationary
is TRUE
and is K-1
otherwise; (2) ntpmpar
is K*(K-1)
(3) nrhopar
is
(nrow(Rho) - K)*(ncol(Rho)-2)
for univariate models
K*(sum(sapply(Rho,nrow))-K)
for bivariate independent models
prod(dim(Rho))-K
for bivariate dependent models.
Numeric scalar. The number by which npar
is multiplied
to form the BIC
criterion. It is essentially the log
of the number of observations. See the code of hmm()
for details.
A logical scalar indicating whether the algorithm converged.
If the EM, LM or steepest descent algorithm was used it simply
indicates whether the stopping criterion was met before
the maximum number (itmax
) of steps was exceeded.
If method="bf"
then converged
is based on the
code
component of the object returned by the optimiser
when nlm()
was used, or on the convergence
component when optim()
was used. In these
cases converged
has an attribute (code
or convergence
respectively) giving the (integer) value
of the relevant component.
Note that in the nlm()
case a value of code
equal to 2 indicates “probable” convergence, and a value
of 3 indicates “possible” convergence. However in this
context converged
is set equal to TRUE
only
if code
is 1.
The number of steps performed by the algorithm if the method
was "EM"
, "LM"
or "SD"
. The value of
nstep
is set equal to the iterations
component of
the value returned by nlm()
if method="bf"
.
The number of EM steps that were taken before the method was
switched from "EM"
to "bf"
or to "LM"
.
Present only in values returned under the "bf"
or
"LM"
methods after a switch from "EM"
and is
equal to 0
if either of these methods was specified in
the initial call (rather than arising as the result of a switch).
Integer vector of the lengths of the observation sequences (number of rows if the observations are in the form of one or two column matrices).
A real number between 0 and 1 or a pair (two dimensional vector) of such numbers. Each number is the the fraction of missing values if the corresponding components of the observations.
An object of class "tidyList"
. It is a tidied up version
of the observations; i.e. the observations y
after the
application of the undocumented function tidyList()
.
Present only if keep.y
is TRUE
.
An object of class "tidyList"
. It is tidied up version
of the predictor matrix or list of predictor matrices; i.e. the
argument X
after the application of tidyList()
(with argument rp
set to "predictor"
. Present only
if X
is supplied, is an appropriate argument, and if
keep.X
is TRUE
.
Character string; "univar"
if the data were univariate,
"bivar"
if they were bivariate.
Logical scalar; TRUE
if the (original) data were numeric,
FALSE
otherwise.
The value of AIC = -2*log.like + 2*npar
for the fitted
model.
The value of BIC = -2*log.like + log(nobs)*npar
for the fitted
model. In the forgoing nobs
is the number of observations.
This is the number of non-missing values in unlist(y)
in the univariate setting and one half of this number in the
bivariate setting.
A list of argument values supplied. This component is
returned in the interest of making results reproducible.
It is also needed to facilitate the updating of a model
via the update method for the class hmm.discnp
,
update.hmm.discnp()
.
It has components:
method
optimiser
optimMethod
stationary
mixture
cis
tolerance
itmax
crit
addIntercept
A vector or a list of vectors, or one or two column matrix
(bivariate setting) or a list of such matrices; missing values
are allowed. If y
is a vector, or list of vectors (of
discrete data) these vectors are coerced to one column matrices.
The entries of these vectors or matrices may be numeric or
character and are assumed to constitute discrete data.
A vector (of length m
, say) of possible values for the
data or a list of two such vectors (of lengths m1
and
m2
, say, one for each of the two variates in the bivariate
settings). These vectors default to the sorted unique values of
the respective variates as provided in y
. If yval
is supplied and any value of y
does not match some value
of yval
, then an error is thrown.
The argument yval
is provided so as to allow for fitting
of models to data in which some of the data values “of interest”
were never observed. The estimated emission probabilities of such
“never observed” values will of course be zero.
An optional (named) list of starting values for the
parameters of the model, with components tpm
(transition
probability matrix), optionally ispd
(initial state
probability distribution) and Rho
. The object Rho
specifies the probability that the observations take on each of
the possible values of the variate or variates, given the state
of the hidden Markov chain. See Details. Note that in
the case of independent bivariate data Rho
is a list
of two matrices. These matrices may (and in general will)
have different row dimensions, but must have identical column
dimensions (equal to K
, the number of states; see below).
If the model is stationary (i.e. if stationary
is
TRUE
) then you should almost surely not specify the
ispd
component of par0
. If you do specify it,
it really only makes sense to specify it to be the stationary
distribution determined by tpm
and this is a waste of
time since this is what the code will take ispd
to be if
you leave it unspecified.
If par0
is not specified, starting values are created by
the (undocumented) function init.all()
.
The number of states in the hidden Markov chain; if par0
is not specified K
MUST be; if par0
is
specified, K
is ignored.
Note that K=1
is acceptable; if K
is 1 then
all observations are treated as being independent and the
non-parametric estimate of the distribution of the observations
is calculated in the “obvious” way.
Either a logical scalar or a list consisting of two logical
scalars which must be named tpm
and Rho
. If the
former, it is converted internally into a list with entries
named tpm
and Rho
, both having the same value as
the original argument. If tpm
is TRUE then the function
init.all() chooses entries for the starting value of tpm
at random; likewise for Rho
. If left NULL
, this
argument defaults to list(tpm=FALSE,Rho=FALSE)
.
Character string, either "bf"
, "EM"
,
"LM"
or "SD"
(i.e. use numerical maximisation
via either nlm()
or optim()
, the EM algorithm, the
Levenberg-Marquardt algorithm, or the method of steepest descent).
May be abbreviated. Currently the "bf"
, "LM"
and
"SD"
methods can be used only in the univariate setting,
handle only stationary models (see below) and do not do mixtures.
Character string; one of "fortran"
, "oraw"
or
"raw"
. May be abbreviated. This argument determines the
procedure by which the hessian, gradient and log likelihood of
the model and data are calculated. If this is argument is equal
to "fortran"
(the default) then (obviously!) dynamically
loaded fortran subroutines are used. The other two possibilities
effect the calculations in raw R; "oraw"
(“o”
for “original” uses code that is essentially a direct
transcription of the fortran code, do-loops being replaced by
for-loops. With method "raw"
the for-loops are eliminated
and matrix-vector calculations are applied. The "oraw"
method is about 25 times slower than the "fortran"
method
and the "raw"
method is (surprisingly?) even worse;
it is more than 30 times slower. The “raw” methods are
present mainly for debugging purposes and would not usually be
used in practice. This argument is used only if the method
is "LM"
or "SD"
(and is involved only peripherally
in the latter instance). It is ignored otherwise.
Character string specifying the optimiser to use when the
“"bf"
” method of optimisation is chosen. It should be
one of "nlm"
or "optim"
, and may be abbreviated.
Ignored unless method="bf"
.
Character string specifying the optimisation method to be used by
optim()
. Should be one of "Nelder-Mead"
,
"BFGS"
, "CG"
, "L-BFGS-B"
, "SANN"
, or
"Brent"
. Ignored if the method
is not "bf"
or if the optimiser is not "optim"
.
Logical scalar. If TRUE
then the model is fitted under
the stationarity assumption, i.e. that the Markov chain was in
steady state at the time that observations commenced. In this
case the initial state probability distribution is estimated
as the stationary distribution determined by the (estimated)
transition probability matrix. Otherwise if cis
(see
below) is TRUE
the initial state probability distribution
is estimated as the mean of the vectors of conditional
probabilities of the states, given the observation sequences,
at time t=1
. If stationary
is TRUE
and
cis
is FALSE
an error is thrown. Currently if
the method is "bf"
, "LM"
or "SD"
, and
stationary
is FALSE
, then an error is thrown.
A logical scalar; if TRUE then a mixture model (all rows of the
transition probability matrix are identical) is fitted rather
than a general hidden Markov model. Currently an error is
thrown if mixture=TRUE
and the method is
"bf"
, "LM"
or "SD"
.
A logical scalar specifying whether there should be a
constant initial state probability
distribution. If stationary
is FALSE
and cis
is FALSE
then the initial state probability distribution
for a given observation sequence is equal to 1 where the (first)
maximum of the vector of conditional probabilities of the states,
given the observation sequences, at time t=1
, occurs,
and is 0 elsewhere. If stationary
is TRUE
and
cis
is FALSE
an error is given.
Logical scalar. Should the bivariate model be fitted under the
assumption that the two variables are (conditionally) independent
give the state? If this argument is left as NULL
its
value is inferred from the structure of Rho
in par0
if the latter is supplied. If the data are bivariate and neither
indep
nor par0
is supplied, then an error is given.
If the data are bivariate and if the value of indep
is inconsistent with the structure of par0$Rho
then an
error is given. If the data are univariate then indep
is ignored.
If the value of the quantity used for the stopping criterion
is less than tolerance then the algorithm is considered to
have converged. Ignored if method="bf"
. Defaults to
1e-4
.
Integer scalar. The number of digits to which to print out
“progress reports” (when verbose
is TRUE
).
There is a “sensible” default (calculated from
tolerance
). Not used if the method is "bf"
.
A logical scalar determining whether to print out details of
the progress of the algorithm. If the method is "EM"
,
"LM"
or "SD"
then when verbose
is TRUE
information about the convergence criteria is printed out at
every step that the algorithm takes. If method="bf"
then
the value of verbose
determines the value of the argument
print.level
of nlm()
or the value of the
argument trace
of optim()
. In the first
case, if verbose
is TRUE
then print.level
is set to 2, otherwise it is set to 0. In the second case,
if verbose
is TRUE
then trace
is set to 6,
otherwise it is set to 0.
When the method is "EM"
, "LM"
or "SD"
this is the maximum number of steps that the algorithm takes.
If the convergence criterion has not been met by the time
itmax
steps have been performed, a warning message
is printed out, and the function stops. A value is returned by
the function anyway, with the logical component converged
set
to FALSE
. When method="bf"
the itmax
argument
is passed to nlm()
as the value of iterlim
or to optim()
as the value of maxit
. If the
(somewhat obscure) convergence criteria of nlm()
or
optim()
have not been met by the time itmax
“iterations” have been performed, the algorithm ceases.
In this case, if nlm()
is used. the value of code
in the object returned set equal to 4 and if optim()
is used then the value of convergence
returned is set
equal to 1. Note that the value of code
, respectively
convergence
is returned as the converged
component
of the object returned by hmm()
. A value of 1 indicates
successful completion of the nlm()
procedure. A value of
0 indicates successful completion of the optim()
procedure.
The name of the stopping criterion used. When method="EM"
it must be one of "PCLL"
(percent change in log-likelihood;
the default), "L2"
(L-2 norm, i.e. square root of sum of
squares of change in coefficients), or "Linf"
(L-infinity
norm, i.e. maximum absolute value of change in coefficients).
When method="LM"
or method="SD"
there is a fourth
possibility, namely "ABSGRD"
the (maximum) absolute value
of the gradient. It may not be advisable to use this criterion
in the current context (i.e. that of discrete non-parametric
distributions). See Warnings. This argument defaults
to "PCLL"
. It is ignored if method="bf"
.
(The nlm()
and optim()
functions have their own
obscure stopping criteria.)
An optional numeric matrix, or a list of such
matrices, of “auxiliary” predictors. The use of
such predictors is (currently, at least) applicable only in the
univariate emissions setting. If X
is a list it must be
of the same length as y
and all entries of this list must
have the same number of columns. If the columns of any entry
of the list are named, then they must be named for all
entries, and the column names must be the same for all
entries. The number of rows of each entry must be equal to the
length of the corresponding entry of y
. If X
is
a matrix then y
should be a vector or one-column matrix
(or a list with a single entry equal to such).
There may be at most one constant column in X
or the
components thereof. If there are any constant columns
there must be precisely one (in all components of X
),
it must be the first column and all of its entries must be equal
to 1
. If the columns have names, the names of this first
column must be "Intercept"
.
Note that X
(or its entries) must be a numeric
matrix (or must be numeric matrices) --- not data frames! Factor
predictors are not permitted. It may be possible to use factor
predictors by supplying X
or its entries as the output of
model.matrix()
; this will depend on circumstances.
The fitted coefficients that are produced when X
is supplied,
are (to put it mildly) a bit difficult to interpret. See
Fitted Coefficients of Auxiliary Predictors for
some discussion.
Logical scalar; should the observations y
be returned as
a component of the value of this function?
Logical scalar; should the predictors X
be returned as
a component of the value of this function? Note that the
value of keep.X
will be silently set equal to FALSE
unless it actually “makes sense” to keep X
. I.e.
unless the observations are univariate
and X
is actually supplied, i.e. is
not NULL
.
Logical scalar. Should a column of ones, corresponding to
an intercept term, be prepended to each of the matrices in
the list X
? If each of these matrices already has an
initial column of ones, then setting addIntercept=TRUE
results in an error being thrown. If this is not the case,
then by default an initial column of ones is added.
Numeric scalar. The (initial) “Levenberg-Marquardt
correction” parameter. Used only if method="LM"
,
otherwise ignored.
Logical scalar. Should the hessian matrix be
returned? This argument is relevant only if method="bf"
(in which case it is passed along to hmmNumOpt()
) and is
ignored otherwise. This argument should be set to TRUE
only if you really want the hessian matrix. Setting it
to TRUE
causes a substantial delay between the time when
hmm()
finishes its iterations and when it actually returns
a value.
Additional arguments passed to hmmNumOpt()
.
There is one noteworthy argument useAnalGrad
which is used
“directly” by hmmNumOpt()
. This argument is a
logical scalar and if it is TRUE
then calls to nlm()
or optim()
are structured so that an analytic calculation
of the gradient vector (implemented by the internal function
get.gl()
is applied. If it is FALSE
then finite
difference methods are used to calculate the gradient vector.
If this argument is not specified it defaults to FALSE
.
Note that the name of this argument cannot be abbreviated.
Other “additional arguments” may be supplied for the
control of nlm()
and are passed on appropriately
to nlm()
. These are used only if method="bf"
and if optimiser="nlm"
. These “...” arguments
might typically include gradtol
, stepmax
and
steptol
. They should NOT include print.level
or iterlim
. The former argument is automatically passed
to nlm()
as 0
if verbose
is FALSE
and as 2
if verbose
is TRUE
. The latter
argument is automatically passed to nlm()
with the value
of itmax
.
A massive nest of bugs was eliminated in the transition from version
3.0-8 to version 3.0-9. These bugs arose in the context of using
auxiliary predictor variables (argument X
). The handling of
such auxiliary predictors was completely messed up. I am grateful
to Leah Walker for pointing out the problem to me.
The ordering of the (hidden) states can be arbitrary. What the estimation procedure decides to call “state 1” may not be what you think of as being state number 1. The ordering of the states will be affected by the starting values used.
Some experiences with using the "ABSGRD"
stopping
criterion indicate that it may be problematic in the context of
discrete non-parametric distributions. For example a value of
1854.955 was returned after 200 LM steps in one (non-convergent,
of course!) attempt at fitting a model. The stopping criterion
"PCLL"
in this example took the “reasonable”
value of 0.03193748 when iterations ceased.
This function used to have an argument newstyle
,
a logical scalar (defaulting to TRUE
) indicating whether
(in the univariate setting) the emission probabilities
should be represented in “logistic” form. (See
Details, Univariate case:, above.) Now the
emission probabilities are always represented in the
“logistic” form. The component Rho
of the
starting parameter values par0
may still be supplied
as a matrix of probabilities (with columns summing to 1), but
this component is converted (internally, silently) to the
logistic form.
The object returned by this function also has (in the univariate
setting), in addition to the component Rho
, a component
Rho.matrix
giving the emission probabilities in the
more readily interpretable matrix-of-probabilities form. (See
Value above.)
The package used to require the argument y
to
be a matrix in the case of multiple observed sequences.
If the series were of unequal length the user was expected to
pad them out with NAs to equalize the lengths.
The old matrix format for multiple observation sequences was
permitted for a while (and the matrix was internally changed into
a list) but this is no longer allowed. If y
is indeed
given as a matrix then this corresponds to a single observation
sequence and it must have one (univariate setting) or two
(bivariate setting) columns which constitute the observations
of the respective variates.
If K=1
then tpm
, ispd
, converged
,
and nstep
are all set equal to NA
in the list
returned by this function.
The estimate of ispd
in the non-stationary setting
is inevitably very poor, unless the number of sequences of
observations (the length of the list y
) is very large.
We have in effect “less than one” relevant observation for
each such sequence.
The returned values of tpm
and Rho
(or the entries
of Rho
when Rho
is a list) have dimension names.
These are formed from the argument yval
if this is
supplied, otherwise from the sorted unique values of the
observations in y
. Likewise the returned value of
ispd
is a named vector, the names being the same as the
row (and column) names of tpm
.
If method
is equal to "EM"
there may be a
decrease (!!!) in the log likelihood at some EM step.
This is “theoretically impossible” but can occur in
practice due to an intricacy in the way that the EM algorithm
treats ispd
when stationary
is TRUE
.
It turns out to be effectively impossible to maximise the expected
log likelihood unless the term in that quantity corresponding
to ispd
is ignored (whence it is ignored).
Ignoring this term is “asymptotically negligible” but
can have the unfortunate effect of occasionally leading to a
decrease in the log likelihood.
If such a decrease is detected, then the algorithm terminates and issues a message to the effect that the decrease occurred. The message suggests that another method be used and that perhaps the results from the penultimate EM step (which are returned by this function) be used as starting values.
It seems to me that it should be the case that such a
decrease in the log likelihood can occur only if stationary
is TRUE
. However I have encountered instances in which
a decrease occurred when stationary
was FALSE
.
I have yet to figure out/track down what is going on here.
If the method
is "EM"
it is actually possible
for the log likelihood to decrease at some EM step.
This is “impossible in an ideal world” but can happen
to the fact the EM algorithm, as implemented in this package
at least, cannot maximise the expected log likelihood if the
component corresponding to the initial state probability
distribution is taken into consideration. This component
should ideally be maximised subject to the constraint that
t(P)%*%ispd = ispd
, but this constraint seems to
effectively impossible to impose. Lagrangian multipliers
don't cut it. Hence the summand in question is ignored at
the M-step. This usually works alright since the summand
is asymptotically negligible, but things can sometimes go
wrong. If such a decrease occurs, an error is thrown.
In previous versions of this package, instead of throwing
an error the hmm()
function would automatically switch
to either the "bf"
or the "LM"
method, depending
whether a matrix X
of auxiliary predictors is supplied,
starting from the penultimate parameter estimates produced
by the EM algorithm. However this appears not to be a good
idea; those “penultimate estimates” appear not to be
good starting values for the other methods. Hence an error
is now thrown and the user is explicitly instructed to invoke
a different method, “starting from scratch”.
It is of course of interest to understand the meaning of the
coefficients that are fitted to the predictors in the model.
If X
is supplied then the number of predictors is (as a rule)
one (for the intercept) plus the number of columns in each entry
of X
. We say “as a rule” because, e.g., the entries
of X
could each have an “intercept” column, or the
addIntercept
argument could be FALSE
. If X
is not supplied there is only one predictor, named Intercept
.
The interpretation of these predictor coefficients is a bit subtle.
To get an idea of what it's all about, consider the output from
example 4
. (See Examples). The fitted coefficients
in question are to be found in columns 3 and onward of the component
Rho
of the object returned by hmm()
. In the context
of example 4
, this object is fit.wap
. (The suffix
wap
stands for “with auxiliary predictors”.)
fit.wap$Rho
y state Intercept ma.com nh.com bo.com
1 lo 1 1.3810463 0.4527982 -3.27161353 -1.9563915
2 mlo 1 0.1255631 -1.1402546 -1.37713744 0.5946980
3 m 1 0.7356526 0.1523734 -2.70841817 -0.1794645
4 mhi 1 0.8479798 -0.2438988 -1.12544989 -0.9650320
5 hi 1 0.0000000 0.0000000 0.00000000 0.0000000
6 lo 2 3.9439410 -0.8355306 -0.77702276 1.4963631
7 mlo 2 2.6189880 -1.9373885 -0.09190623 0.8316870
8 m 2 2.1457317 -1.7276183 0.19524655 -0.3249485
9 mhi 2 1.8834139 -1.3760011 -0.59806309 1.2828365
10 hi 2 0.0000000 0.0000000 0.00000000 0.0000000
If you multiply the matrix consisting of the predictor coefficients
(columns 3 to 6 of Rho
in this instance) times a vector of
predictors you get, for each state, the “exponential form”
of the probabilities (“pre-probabilities”) for each of the
possible y
-values, given the vector of predictors.
E.g. set x <- c(1,1,0,0)
. This vector picks up the intercept
and indicates that the Malabar outfall has been commissioned,
the North Head outfall has not been commissioned, and the Bondi
Offshore outfall has not been commissioned.
Now set:
pp1 <- (as.matrix(fit.wap$Rho)[,3:6]%*%x)[1:5]
pp2 <- (as.matrix(fit.wap$Rho)[,3:6]%*%x)[6:10]
Note that pp1
consists of “exponential
probabilities” corresponding to state 1, and pp2
consists of “exponential probabilities” corresponding
to state 2. To convert the foregoing pre-probabilities to the
actual probabilities of the y
-values, we apply the ---
undocumented --- function expForm2p()
:
p1 <- expForm2p(pp1)
p2 <- expForm2p(pp2)
The value of p1
is
[1] 0.52674539 0.03051387 0.20456767 0.15400019 0.08417288
and that of p2
is
[1] 0.78428283 0.06926632 0.05322204 0.05819340 0.03503541
Note that p1
and p2
each sum to 1, as they should/must
do. This says, e.g., that when the system is in state 2, and
Malabar has been commissioned but North Head and Bondi Offshore
have not, the (estimated) probability that y
is "mhi"
(medium-high) is 0.05819340.
It may be of some interest to test the hypothesis that the predictors have any actual predictive power at all:
fit.nap <- hmm(xxx,yval=Yval,K=2,verb=TRUE)
# "nap" <--> no aux. preds
There is a bit of a problem here, in that the likelihood decreases at EM step 65. (See the warning message.)
We can check on this problem by refitting using method="LM".
fit.nap.lm <- hmm(xxx,yval=Yval,par0=fit.nap,method="LM",verb=TRUE)
Doing so produces only a small improvement in the log likelihood
(from -1821.425 to -1820.314), so we really could have ignored the
problem. We can now do anova(fit.wap,fit.nap)
which gives
$stat
[1] 153.5491 $df
[1] 24
$pvalue
[1] 7.237102e-21
Thus the p-value is effectively zero, saying that in this instance the auxiliary predictors appear to have a “significant” impact on the fit.
Rolf Turner
r.turner@auckland.ac.nz
Univariate case:
In the univariate case the emission probabilities are specified by
means of a data frame Rho
. The first column of Rho
,
named "y"
, is a factor consisting of the possible values
of the emissions, repeated K
times (where K
is
the number of states). The second column, named states
,
is a factor consisting of integer values 1, 2, ..., K
.
Each of these values is repeated m
times where m
is the length of yval
. Further columns of Rho
are numeric and consist of coefficients of the linear predictor of
the probabilities of the various values of y
. If X
is NULL
then Rho
has only one further column named
Intercept
.
If X
is not NULL
then the Intercept
column is present only if addIntercept
is TRUE
.
There as many (other, in addition to the possible Intercept
column) numeric columns as there are columns in X
or in
the matrices in the list X
. The names of these columns
are taken to be the column names of X
or the first
entry of X
if such column names are present. Otherwise the
names default to V1
, V2
....
The probabilities of the emissions taking on their
various possible values are given by
$$\Pr(Y = y_i | \boldsymbol{x}, \textrm{state}=S) =
\ell_i/\sum_{j=1}^m \ell_j$$
where \(\ell_j\) is the \(j\textrm{th}\)
entry of \(\boldsymbol{\beta}^{\top}\boldsymbol{x}\)
and where in turn \(\boldsymbol{x}\) is the vector
of predictors and \(\boldsymbol{\beta}\) is the
coefficient vector in the linear predicator that corresponds to
\(y_i\) and the hidden state \(S\). For identifiability the
vectors \(\boldsymbol{\beta}\) corresponding to
the first value of \(Y\) (the first level of Rho$y
) are
set equal to the zero vector for all values of the state \(S\).
Note that the Rho
component of the starting values
par0
may be specified as a matrix of probabilities,
with rows corresponding to possible values of the observations and
columns corresponding to states. That is the Rho
component
of par0
may be provided in the form \(\textrm{Rho} =
[\rho_{ij}]\) where \(\rho_{ij} = \Pr(Y = y_i
| S = j)\). This is permissible
as long as X
is NULL
and may be found to be more
convenient and intuitive. If the starting value for Rho
is provided in matrix form it is (silently) converted internally
into the data frame form, by the (undocumented) function
cnvrtRho()
.
When argument X
is not NULL
, it is
difficult to specify a “reasonable” value for the
Rho
component of par0
. One might try to specify
par0$Rho
in the data frame form. The question of how
to specify the columns of par0$Rho
corresponding to the
auxiliary predictors (columns of X
or of the entries of
X
) is a thorny one.
It is permissible in these circumstances to specify
par0$Rho
as a matrix of probabilities, just as one
would do if X
were NULL
. In this setting the
(undocumented) function checkStartVal()
converts the
matrix of probabilities to data frame form and then appends
columns, all of whose entries are 0, corresponding to the
auxiliary predictors. When par0
is unspecified, the
(undocumented) function init.all()
performs similar
construction to accommodate a non-NULL
value of X
.
Whether the resulting starting value for Rho
makes any
real sense, is questionable. However little else can be done.
Independent bivariate case: the emission probabilities are specified by a list of two matrices. In this setting \(\Pr(Y_1,Y_2) = (y_{i1},y_{i2}) | S = j) = \rho^{(1)}_{i_1,j} \rho^{(2)}_{i_2,j}\) where \(R^{(k)} = [\rho^{(k)}_{ij}]\) (\(k = 1,2\)) are the two emission probability matrices.
Dependent bivariate case: the emission probabilities are specified by a three dimensional array. In this setting \(\Pr((Y_1,Y_2) = (y_{i1},y_{i2}) | S = j) = \rho_{i_1,i_2,j}\) where \(R = [\rho_{ijk}]\) is the emission probability array.
The hard work of calculating the recursive probabilities used
to fit the model is done by a Fortran subroutine recurse
(actually coded in Ratfor) which is dynamically loaded. In the
univariate case, when X
is provided, the estimation of the
“linear predictor” vectors \(\boldsymbol{\beta}\)
is handled by the function multinom()
from the nnet
package. Note that this is a “Recommended” package
and is thereby automatically available (i.e. does not have to
be installed).
Rabiner, L. R., "A tutorial on hidden Markov models and selected applications in speech recognition," Proc. IEEE vol. 77, pp. 257 -- 286, 1989.
Zucchini, W. and Guttorp, P., "A hidden Markov model for space-time precipitation," Water Resources Research vol. 27, pp. 1917-1923, 1991.
MacDonald, I. L., and Zucchini, W., "Hidden Markov and Other Models for Discrete-valued Time Series", Chapman & Hall, London, 1997.
Liu, Limin, "Hidden Markov Models for Precipitation in a Region of Atlantic Canada", Master's Report, University of New Brunswick, 1997.
rhmm()
, mps()
,
viterbi()
# TO DO: Create one or more bivariate examples.
#
# The value of itmax in the following examples is so much
# too small as to be risible. This is just to speed up the
# R CMD check process.
# 1.
Yval <- LETTERS[1:10]
Tpm <- matrix(c(0.75,0.25,0.25,0.75),ncol=2,byrow=TRUE)
Rho <- cbind(c(rep(1,5),rep(0,5)),c(rep(0,5),rep(1,5)))/5
rownames(Rho) <- Yval
set.seed(42)
xxx <- rhmm(ylengths=rep(1000,5),nsim=1,tpm=Tpm,Rho=Rho,yval=Yval,drop=TRUE)
fit <- hmm(xxx,par0=list(tpm=Tpm,Rho=Rho),itmax=10)
print(fit$Rho) # A data frame
print(cnvrtRho(fit$Rho)) # A matrix of probabilities
# whose columns sum to 1.
# 2.
# See the help for logLikHmm() for how to generate y.num.
if (FALSE) {
fit.num <- hmm(y.num,K=2,verb=TRUE,itmax=10)
fit.num.mix <- hmm(y.num,K=2,verb=TRUE,mixture=TRUE,itmax=10)
print(fit.num[c("tpm","Rho")])
}
# Note that states 1 and 2 get swapped.
# 3.
xxx <- with(SydColDisc,split(y,f=list(locn,depth)))
Yval <- c("lo","mlo","m","mhi","hi")
# Two states: above and below the thermocline.
fitSydCol <- hmm(xxx,yval=Yval,K=2,verb=TRUE,itmax=10)
# 4.
X <- split(SydColDisc[,c("ma.com","nh.com","bo.com")],
f=with(SydColDisc,list(locn,depth)))
X <- lapply(X,function(x){
as.matrix(as.data.frame(lapply(x,as.numeric)))-1})
fit.wap <- hmm(xxx,yval=Yval,K=2,X=X,verb=TRUE,itmax=10)
# wap <--> with auxiliary predictors.
# 5.
if (FALSE) # Takes too long.
fitlm <- hmm(xxx,yval=Yval,K=2,method="LM",verb=TRUE)
fitem <- hmm(xxx,yval=Yval,K=2,verb=TRUE)
# Algorithm terminates due to a decrease in the log likelihood
# at EM step 64.
newfitlm <- hmm(xxx,yval=Yval,par0=fitem,method="LM",verb=TRUE)
# The log likelihood improves from -1900.988 to -1820.314
# 6.
fitLesCount <- hmm(lesionCount,K=2,itmax=10) # Two states: relapse and remission.
Run the code above in your browser using DataLab