Learn R Programming

dsmmR (version 1.0.7)

fit_dsmm: Estimation of a drifting semi-Markov chain

Description

Estimation of a drifting semi-Markov chain, given one sequence of states. This estimation can be parametric or non-parametric and is available for the three types of drifting semi-Markov models.

Usage

fit_dsmm(
  sequence,
  degree,
  f_is_drifting,
  p_is_drifting,
  states = NULL,
  initial_dist = "unif",
  estimation = "nonparametric",
  f_dist = NULL
)

Value

Returns an object of S3 class (dsmm_fit_nonparametric, dsmm)

or (dsmm_fit_parametric, dsmm). It has the following attributes:

  • dist : List. Contains 2 or 3 arrays.

    • If estimation = "nonparametric" we have 2 arrays:

      • p_drift or p_notdrift, corresponding to whether the defined \(p\) transition matrix is drifting or not.

      • f_drift or f_notdrift, corresponding to whether the defined \(f\) sojourn time distribution is drifting or not.

    • If estimation = "parametric" we have 3 arrays:

      • p_drift or p_notdrift, corresponding to whether the defined \(p\) transition matrix is drifting or not.

      • f_drift_parametric or f_notdrift_parametric, corresponding to whether the defined \(f\) sojourn time distribution is drifting or not.

      • f_drift_parameters or f_notdrift_parameters, which are the defined \(f\) sojourn time distribution parameters, depending on whether \(f\) is drifting or not.

  • emc : Character vector that contains the embedded Markov chain \((J_{t})_{t\in \{0,\dots,n\}}\) of the original sequence. It is this attribute of the object that describes the size of the model \(n\). Last state is also included, for a total length of \(n+1\), but it is not used for any calculation.

  • soj_times : Numerical vector that contains the sojourn times spent for each state in emc before the jump to the next state. Last state is also included, for a total length of \(n+1\), but it is not used for any calculation.

  • initial_dist : Numerical vector that contains an estimation for the initial distribution of the realized states in sequence. It always has values between \(0\) and \(1\).

  • states : Character vector. Passing down from the arguments. It contains the realized states given in the argument sequence.

  • s : Positive integer that contains the length of the character vector given in the attribute states, which is equal to \(s = |E|\).

  • degree : Positive integer. Passing down from the arguments. It contains the polynomial degree \(d\) considered for the drifting of the model.

  • k_max : Numerical value that contains the maximum sojourn time, which is the maximum value in soj_times, excluding the last state.

  • model_size : Positive integer that contains the size of the drifting semi-Markov model \(n\), which is equal to the length of the embedded Markov chain \((J_{t})_{t\in \{0,\dots,n\}}\), minus the last state. It has a value of length(emc) - 1, for emc as defined above.

  • f_is_drifting : Logical. Passing down from the arguments. Specifies if \(f\) is drifting or not.

  • p_is_drifting : Logical. Passing down from the arguments. Specifies if \(p\) is drifting or not.

  • Model : Character. Possible values:

    • "Model_1" : Both \(p\) and \(f\) are drifting.

    • "Model_2" : \(p\) is drifting and \(f\) is not drifting.

    • "Model_3" : \(f\) is drifting and \(p\) is not drifting.

  • estimation : Character. Specifies whether parametric or nonparametric estimation was used.

  • A_i : Numerical Matrix. Represents the polynomials \(A_i(t)\) with degree \(d\) that were used for solving the system \(MJ = P\). Used for the methods defined for the object. Not printed when viewing the object.

  • J_i : Numerical Array. Represents the estimated semi-Markov kernels of the first model \((\widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l))_{i\in\{0,\dots,d\}}\) that were obtained after solving the system \(MJ = P\). Not printed when viewing the object.

Arguments

sequence

Character vector that represents a sequence of states from the state space \(E\).

degree

Positive integer that represents the polynomial degree \(d\) for the drifting semi-Markov model.

f_is_drifting

Logical. Specifies if \(f\) is drifting or not.

p_is_drifting

Logical. Specifies if \(p\) is drifting or not.

states

Character vector that represents the state space \(E\), with length equal to \(s = |E|\). Default value is set equal to the sorted, unique states present in the given sequence.

initial_dist

Optional. Character that represents the method to estimate the initial distribution.

  • "unif" : The initial distribution of each state is equal to \(1/s\) (default value).

  • "freq" : The initial distribution of each state is equal to the frequency that it has in the sequence.

estimation

Optional. Character. Represents whether the estimation will be nonparametric or parametric.

  • "nonparametric" : The estimation will be non-parametric (default value).

  • "parametric" : The estimation will be parametric.

f_dist

Optional. It can be defined in two ways:

  • If estimation = "nonparametric", it is equal to NULL (default value).

  • If estimation = "parametric", it is a character array that specifies the distributions of the sojourn times, for every state transition. The list of possible values is: ["unif", "geom", "pois", "dweibull", "nbinom", NA]. It can be defined in two ways:

    • If \(f\) is not drifting, it has dimensions of \(s \times s\).

    • If \(f\) is drifting, it has dimensions of \(s \times s \times (d+1)\) (see more in Details, Parametric Estimation.)

    It is defined similarly to the attribute f_dist in dsmm_parametric.

Details

This function estimates a drifting semi-Markov model in the parametric and non-parametric case. The parametric estimation can be achieved by the following steps:

  1. We obtain the non-parametric estimation of the sojourn time distributions.

  2. We estimate the parameters for the distributions defined in the attribute f_dist through the probabilities that were obtained in step 1.

Three different models are possible for to be estimated for each case. A normalization technique is used in order to correct estimation errors from small sequences. We will use the exponentials \((1), (2), (3)\) to distinguish between the drifting semi-Markov kernel \(\widehat{q}_{\frac{t}{n}}\) and the semi-Markov kernels \(\widehat{q}_\frac{i}{d}\) used in Model 1, Model 2, Model 3. More about the theory of drifting semi-Markov models in dsmmR.

Non-parametric Estimation

Model 1

When the transition matrix \(p\) of the embedded Markov chain \((J_{t})_{t\in \{0,\dots,n\}}\) and the conditional sojourn time distribution \(f\) are both drifting, the drifting semi-Markov kernel can be estimated as: $$\widehat{q}_{\frac{t}{n}}^{\ (1)}(u,v,l) = \sum_{i = 0}^{d}A_{i}(t)\ \widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l),$$ \(\forall t \in \{0,\dots,n\}, \forall u,v\in E, \forall l \in \{1,\dots, k_{max} \} \), where \(k_{max}\) is the maximum sojourn time that was observed in the sequence and \(A_i, i = 0, \dots, d\) are \(d + 1\) polynomials with degree \(d\) (see dsmmR).

The semi-Markov kernels \(\widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l), i = 0, \dots, d\), are estimated through Least Squares Estimation (LSE) and are obtained after solving the following system, \(\forall t \in \{0, \dots, n\}\), \(\forall u, v \in E\) and \(\forall l \in \{1, \dots, k_{max}\}\): $$MJ = P,$$ where the matrices are written as:

  • \(M = (M_{ij})_{i,j \in \{0, \dots, d\} } = \left(\sum_{t=1}^{n}1_{u}(t)A_{i}(t)A_{j}(t)\right)_{ i,j \in \{0, \dots, d\}}\)

  • \(J = (J_i)_{i \in \{0, \dots, d\} } = \left(\widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l)\right)_{ i \in \{0, \dots, d\}} \)

  • \(P=(P_i)_{i\in \{0, \dots, d\} }= \left(\sum_{t=1}^{n}1_{uvl}(t)A_{i}(t)\right)_{ i \in \{0, \dots, d\}} \)

and we use the following indicator functions:

  • \(1_{u}(t) = 1_{ \{J_{t-1} = u \} } = 1\), if at \(t\) the previous state is \(u\), \(0\) otherwise.

  • \(1_{uvl}(t) = 1_{ \{J_{t-1} = u, J_{t} = v, X_{t} = l \} } = 1\), if at \(t\) the previous state is \(u\) with sojourn time \(l\) and next state \(v\), \(0\) otherwise.

In order to obtain the estimations of \(\widehat{p}_{\frac{i}{d}}(u,v)\) and \(\widehat{f}_{\frac{i}{d}}(u,v,l)\), we use the following formulas:

$$\widehat{p}_{\frac{i}{d}}(u,v) = \sum_{l = 1}^{k_{max}}\widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l),$$ $$\widehat{f}_{\frac{i}{d}}(u,v,l) = \frac{\widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l)}{ \sum_{l = 1}^{k_{max}}\widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l)}.$$

Model 2

In this case, \(p\) is drifting and \(f\) is not drifting. Therefore, the estimated drifting semi-Markov kernel will be given by:

$$\widehat{q}_{\frac{t}{n}}^{\ (2)}(u,v,l) = \sum_{i=0}^{d} A_{i}(t)\ \widehat{q}_{\frac{i}{d}}^{\ (2)}(u,v,l),$$

\(\forall t \in \{0,\dots,n\}, \forall u,v\in E, \forall l\in \{1,\dots, k_{max} \}\), where \(k_{max}\) is the maximum sojourn time that was observed in the sequence and \(A_i, i = 0, \dots, d\) are \(d + 1\) polynomials with degree \(d\) (see dsmmR). In order to obtain the estimators \(\widehat{p}\) and \(\widehat{f}\), we use the estimated semi-Markov kernels \(\widehat{q}_{\frac{i}{d}}^{\ (1)}\) from Model 1. Since \(p\) is drifting, we define the estimation \(\widehat{p}\) the same way as we did in Model 1. In total, we have the following estimations, \(\forall u,v \in E, \forall l \in \{1,\dots, k_{max} \}\):

$$\widehat{p}_{\frac{i}{d}}(u,v) = \sum_{l = 1}^{k_{max}}\widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l),$$ $$\widehat{f}(u,v,l) = \frac{\sum_{i=0}^{d}\widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l)}{ \sum_{i=0}^{d}\sum_{l=1}^{k_{max}}\widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l)}.$$

Thus, the estimated semi-Markov kernels for Model 2, \(\widehat{q}_{\frac{i}{d}}^{\ (2)}(u,v,l) = \widehat{p}_{\frac{i}{d}}(u,v)\widehat{f}(u,v,l)\), can be written with regards to the estimated semi-Markov kernels of Model 1, \(\widehat{q}_{\frac{i}{d}}^{\ (1)}\), as in the following:

$$\widehat{q}_{\frac{i}{d}}^{\ (2)}(u,v,l) = \frac{ (\sum_{l=1}^{k_{max}}\widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l)) (\sum_{i = 0}^{d}\widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l))}{ \sum_{i = 0}^{d}\sum_{l=1}^{k_{max}}\widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l)}.$$

Model 3

In this case, \(f\) is drifting and \(p\) is not drifting. Therefore, the estimated drifting semi-Markov kernel will be given by: $$\widehat{q}_{\frac{t}{n}}^{\ (3)}(u,v,l) = \sum_{i=0}^{d} A_{i}(t)\ \widehat{q}_{\frac{i}{d}}^{\ (3)}(u,v,l),$$ \(\forall t \in \{0,\dots,n\}, \forall u,v\in E, \forall l\in \{1,\dots, k_{max} \}\), where \(k_{max}\) is the maximum sojourn time that was observed in the sequence and \(A_i, i = 0, \dots, d\) are \(d + 1\) polynomials with degree \(d\) (see dsmmR). In order to obtain the estimators \(\widehat{p}\) and \(\widehat{f}\), we use the estimated semi-Markov kernels estimated semi-Markov kernels \(\widehat{q}_{\frac{i}{d}}^{\ (1)}\) from Model 1. Since \(f\) is drifting, we define the estimation \(\widehat{f}\) the same way as we did in Model 1. In total, we have the following estimations, \(\forall u,v \in E, \forall l \in \{1,\dots, k_{max} \}\):

$$\widehat{p}\ (u,v) = \frac{\sum_{i=0}^{d}\sum_{l=1}^{k_{max}} \widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l)}{d+1},$$ $$\widehat{f}_{\frac{i}{d}}(u,v,l) = \frac{\widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l)}{ \sum_{l = 1}^{k_{max}}\widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l)}.$$

Thus, the estimated semi-Markov kernels for Model 3, \(\widehat{q}_{\frac{i}{d}}^{\ (3)}(u,v,l) = \widehat{p}\ (u,v)\widehat{f}_{\frac{i}{d}}(u,v,l)\), can be written with regards to the estimated semi-Markov kernels of Model 1, \(\widehat{q}_{\frac{i}{d}}^{\ (1)}\), as in the following:

$$\widehat{q}_{\frac{i}{d}}^{\ (3)}(u,v,l) = \frac{ \widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l) \sum_{i=0}^{d}\sum_{l=1}^{k_{max}}\widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l)} {(d+1)\sum_{l=1}^{k_{max}}\widehat{q}_{\frac{i}{d}}^{\ (1)}(u,v,l)}.$$

Parametric Estimation

In this package, the parametric estimation of the sojourn time distributions defined in the attribute f_dist is achieved as follows:

  1. We obtain the non-parametric LSE of the sojourn time distributions \(f\).

  2. We estimate the parameters for the distributions defined in f_dist through the probabilities of \(f\), estimated in previously in \(1\).

The available distributions for the modelling of the conditional sojourn times of the drifting semi-Markov model, defined from the argument f_dist, have their parameters estimated through the following formulas:

  • Geometric \((p)\):

    \(f(x) = p (1-p)^{x-1}\), where \(x = 1, 2, \dots,k_{max}\). We estimate the probability of success \(\widehat{p}\) as such: $$\widehat{p} = \frac{1}{E(X)}$$

  • Poisson \((\lambda)\):

    \(f(x) = \frac{\lambda^{x-1} exp(-\lambda)}{(x-1)!}\), where \(x = 1, 2,\dots,k_{max}\). We estimate \(\widehat{\lambda} > 0\) as such: $$\widehat{\lambda} = E(X)$$

  • Negative binomial \((\alpha, p)\):

    \(f(x)=\frac{\Gamma(x + \alpha - 1)}{\Gamma(\alpha)(x-1)!} p^{\alpha}(1-p)^{x-1}\), where \(x = 1, 2,\ldots,k_{max}\). \(\Gamma\) is the Gamma function, \(p\) is the probability of success and \(\alpha \in (0, +\infty) \) is the parameter describing the number of successful trials, or the dispersion parameter (the shape parameter of the gamma mixing distribution). We estimate them as such:

    $$\widehat{p} = \frac{E(X)}{Var(X)},$$ $$\widehat{\alpha} = E(X)\frac{\widehat{p}}{1 - \widehat{p}} = \frac{E(X)^2}{Var(X) - E(X)}.$$

  • Discrete Weibull of type 1 \((q, \beta)\):

    \(f(x)=q^{(x-1)^{\beta}}-q^{x^{\beta}}\), where \(x= 1, 2, \dots,k_{max}\), \(q\) is the first parameter with \(0 < q < 1\) and \(\beta \in (0, +\infty)\) the second parameter. We estimate them as such:

    $$\widehat{q} = 1 - f(1),$$ $$\widehat{\beta} = \frac{\sum_{i = 2}^{k_{max}} \log_{i}(\log_{\widehat{q}}(\sum_{j = 1}^{i}f(j)))}{k_{max} - 1}. $$

    Note that we require \(k_{max} \geq 2\) for estimating \(\widehat{\beta}\).

  • Uniform \((n)\): \(f(x) = 1/n\) where \(x = 1, 2, \dots, n\), for \(n\) a positive integer. We use a numerical method to obtain an estimator for \(\widehat{n}\) in this case.

References

V. S. Barbu, N. Limnios. (2008). semi-Markov Chains and Hidden semi-Markov Models Toward Applications - Their Use in Reliability and DNA Analysis. New York: Lecture Notes in Statistics, vol. 191, Springer.

Vergne, N. (2008). Drifting Markov models with Polynomial Drift and Applications to DNA Sequences. Statistical Applications in Genetics Molecular Biology 7 (1).

Barbu V. S., Vergne, N. (2019). Reliability and survival analysis for Drifting Markov models: modelling and estimation. Methodology and Computing in Applied Probability, 21(4), 1407-1429.

T. Nakagawa and S. Osaki. (1975). The discrete Weibull distribution. IEEE Transactions on Reliability, R-24, 300-301.

See Also

For the theoretical background of drifting semi-Markov models: dsmmR.

For sequence simulation: simulate.dsmm and create_sequence.

For drifting semi-Markov model specification: parametric_dsmm, nonparametric_dsmm

For the retrieval of the drifting semi-Markov kernel: get_kernel.

Examples

Run this code
# Create a random sequence
sequence <- create_sequence("DNA", len = 2000, seed = 1)
## Alternatively, we could obtain a sequence as follows:
## > data("lambda", package = "dsmmR")
## > sequence <- c(lambda)
states <- sort(unique(sequence))
degree <- 3


# ===========================================================================
# Nonparametric Estimation.
# Fitting a random sequence under distributions of unknown shape.
# ===========================================================================


# ---------------------------------------------------------------------------
# Both p and f are drifting - Model 1.
# ---------------------------------------------------------------------------

obj_model_1 <- fit_dsmm(sequence = sequence,
                        degree = degree,
                        f_is_drifting = TRUE,
                        p_is_drifting = TRUE,
                        states = states,
                        initial_dist = "freq",
                        estimation = "nonparametric", # default value
                        f_dist = NULL # default value
                        )

cat(paste0("We fitted a sequence with ", obj_model_1$Model, ",\n",
           "model size: n = ", obj_model_1$model_size, ",\n",
           "length of state space: s = ", obj_model_1$s, ",\n",
           "maximum sojourn time: k_max = ", obj_model_1$k_max, " and\n",
           "polynomial (drifting) Degree: d = ", obj_model_1$degree, ".\n"))

# Get the drifting p and f arrays.
p_drift <- obj_model_1$dist$p_drift
f_drift <- obj_model_1$dist$f_drift

cat(paste0("Dimension of p_drift: (s, s, d + 1) = (",
           paste(dim(p_drift), collapse = ", "), ").\n",
           "Dimension of f_drift: (s, s, k_max, d + 1) = (",
           paste(dim(f_drift), collapse = ", "), ").\n"))

# We can even check the embedded Markov chain and the sojourn times
# directly from the returned object, if we wish to do so.
# This is achieved through the `base::rle()` function, used on `sequence`.
model_emc <- obj_model_1$emc
model_sojourn_times <- obj_model_1$soj_times

# ---------------------------------------------------------------------------
# Fitting the sequence when p is drifting and f is not drifting - Model 2.
# ---------------------------------------------------------------------------


obj_model_2 <- fit_dsmm(sequence = sequence,
                        degree = degree,
                        f_is_drifting = FALSE,
                        p_is_drifting = TRUE)


cat(paste0("We fitted a sequence with ", obj_model_2$Model, ".\n"))


# Get the drifting p and non-drifting f arrays.
p_drift_2 <- obj_model_2$dist$p_drift
f_notdrift <- obj_model_2$dist$f_notdrift


all.equal.numeric(p_drift, p_drift_2) # p is the same as in Model 1.


cat(paste0("Dimension of f_notdrift: (s, s, k_max) = (",
           paste(dim(f_notdrift), collapse = ", "), ").\n"))



# ---------------------------------------------------------------------------
# Fitting the sequence when f is drifting and p is not drifting - Model 3.
# ---------------------------------------------------------------------------

obj_model_3 <- fit_dsmm(sequence = sequence,
                        degree = degree,
                        f_is_drifting = TRUE,
                        p_is_drifting = FALSE)

cat(paste0("We fitted a sequence with ", obj_model_3$Model, ".\n"))
# Get the drifting f and non-drifting p arrays.
p_notdrift <- obj_model_3$dist$p_notdrift
f_drift_3 <- obj_model_3$dist$f_drift
all.equal.numeric(f_drift, f_drift_3) # f is the same as in Model 1.
cat(paste0("Dimension of f_notdrift: (s, s) = (",
           paste(dim(p_notdrift), collapse = ", "), ").\n"))

# ===========================================================================
# Parametric Estimation
# Fitting a random sequence under distributions of known shape.
# ===========================================================================
### Comments
### 1.  For the parametric estimation it is recommended to use a common set
###     of distributions while only the parameters (of the sojourn times)
###     are drifting. This results in (generally) higher accuracy.
### 2.  This process is similar to that used in `dsmm_parametric()`.


s <- length(states)
# Getting the distributions for the states.
# Rows correspond to previous state `u`.
# Columns correspond to next state `v`.
f_dist_1 <- matrix(c(NA,         "unif",     "dweibull", "nbinom",
                     "pois",      NA,        "pois",     "dweibull",
                     "geom",     "pois",      NA,        "geom",
                     "dweibull", 'geom',     "pois",      NA),
                   nrow = s, ncol = s, byrow = TRUE)
f_dist <- array(f_dist_1, dim = c(s, s, degree + 1))
dim(f_dist)

# ---------------------------------------------------------------------------
# Both p and f are drifting - Model 1.
# ---------------------------------------------------------------------------

obj_fit_parametric <- fit_dsmm(sequence = sequence,
                               degree = degree,
                               f_is_drifting = TRUE,
                               p_is_drifting = TRUE,
                               states = states,
                               initial_dist = 'unif',
                               estimation = 'parametric',
                               f_dist = f_dist)
cat("The class of `obj_fit_parametric` is : (",
    paste0(class(obj_fit_parametric), collapse = ', '), ").\n")

# Estimated parameters.
f_params <- obj_fit_parametric$dist$f_drift_parameters

# The drifting sojourn time distribution parameters.
f_0 <- f_params[,,,1]
f_1.3 <- f_params[,,,2]
f_2.3 <- f_params[,,,3]
f_1 <- f_params[,,,4]

params <- paste0('q = ', round(f_params["c", "t", 1, ], 3),
                 ', beta = ', round(f_params["c", "t", 2, ], 3))
f_names <- c("f_0", paste0("f_", 1:(degree-1), "/", degree), "f_1")
all_names <- paste(f_names, ":", params)
cat("The drifting of the parameters for passing from \n",
    "`u` = 'c' to `v` = 't' under a discrete Weibull distribution is:",
    "\n", all_names[1], "\n", all_names[2],
    "\n", all_names[3], "\n", all_names[4])

# ---------------------------------------------------------------------------
# f is not drifting, only p is drifting - Model 2.
# ---------------------------------------------------------------------------

obj_fit_parametric_2 <- fit_dsmm(sequence = sequence,
                                 degree = degree,
                                 f_is_drifting = FALSE,
                                 p_is_drifting = TRUE,
                                 initial_dist = 'unif',
                                 estimation = 'parametric',
                                 f_dist = f_dist_1)

cat("The class of `obj_fit_parametric_2` is : (",
    paste0(class(obj_fit_parametric_2), collapse = ', '), ").\n")
# Estimated parameters.
f_params_2 <- obj_fit_parametric_2$dist$f_notdrift_parameters

params_2 <- paste0('q = ', round(f_params_2["c", "t", 1], 3),
                   ', beta = ', round(f_params_2["c", "t", 2], 3))

cat("Not-drifting parameters for passing from ",
    "`u` = 'c' to `v` = 't' \n under a discrete Weibull distribution are:\n",
    paste("f :", params_2))


# ===========================================================================
# `simulate()` and `get_kernel()` can be used for the two objects,
# `dsmm_fit_nonparametric` and `dsmm_fit_parametric`.
# ===========================================================================


sim_seq_nonparametric <- simulate(obj_model_1, nsim = 10)
str(sim_seq_nonparametric)


kernel_drift_parametric <- get_kernel(obj_fit_parametric, klim = 10)
str(kernel_drift_parametric)

Run the code above in your browser using DataLab