fitdmm: Point by point estimates of a k-th order drifting Markov Model

Description

Estimation of d+1 points of support transition matrices and \(|E|^{k}\) initial law of a k-th order drifting Markov Model starting from one or several sequences.

Usage

fitdmm(
  sequences,
  order,
  degree,
  states,
  init.estim = c("mle", "freq", "prod", "stationary", "unif"),
  fit.method = c("sum"),
  ncpu = 2
)

Value

An object of class dmm

Arguments

sequences: A list of character vector(s) representing one (several) sequence(s)
order: Order of the Markov chain
degree: Degree of the polynomials (e.g., linear drifting if degree=1, etc.)
states: Vector of states space of length s > 1
init.estim: Default="mle". Method used to estimate the initial law. If init.estim = "mle", then the classical Maximum Likelihood Estimator is used, if init.estim = "freq", then, the initial distribution init.estim is estimated by taking the frequences of the words of length k for all sequences. If init.estim = "prod", then, init.estim is estimated by using the product of the frequences of each letter (for all the sequences) in the word of length k. If init.estim = "stationary", then init.estim is estimated by using the stationary law of the point of support transition matrices of each letter. If init.estim = "unif", then, init.estim of each letter is estimated by using \(\frac{1}{s}\). Or `init.estim`= customisable vector of length \(|E|^k\). See Details for the formulas.
fit.method: If sequences is a list of several character vectors of the same length, the usual LSE over the sample paths is proposed when fit.method="sum" (a list of a single character vector is its special case).
ncpu: Default=2. Represents the number of cores used to parallelized computation. If ncpu=-1, then it uses all available cores.

Author

Geoffray Brelurut, Alexandre Seiller

Details

The fitdmm function creates a drifting Markov model object dmm.

Let \(E={1,\ldots, s}\), s < \(\infty\) be random system with finite state space, with a time evolution governed by discrete-time stochastic process of values in \(E\). A sequence \(X_0, X_1, \ldots, X_n\) with state space \(E= {1, 2, \ldots, s}\) is said to be a linear drifting Markov chain (of order 1) of length \(n\) between the Markov transition matrices \(\Pi_0\) and \(\Pi_1\) if the distribution of \(X_t\), \(t = 1, \ldots, n\), is defined by \(P(X_t=v \mid X_{t-1} = u, X_{t-2}, \ldots ) = \Pi_{\frac{t}{n}}(u, v), ; u, v \in E\), where \(\Pi_{\frac{t}{n}}(u, v) = ( 1 - \frac{t}{n}) \Pi_0(u, v) + \frac{t}{n} \Pi_1(u, v), \; u, v \in E\). The linear drifting Markov model of order \(1\) can be generalized to polynomial drifting Markov model of order \(k\) and degree \(d\).Let \(\Pi_{\frac{i}{d}} = (\Pi_{\frac{i}{d}}(u_1, \dots, u_k, v))_{u_1, \dots, u_k,v \in E}\) be \(d\) Markov transition matrices (of order \(k\)) over a state space \(E\).

The estimation of DMMs is carried out for 4 different types of data :

One can observe one sample path :

It is denoted by \(H(m,n):= (X_0,X_1, \ldots,X_{m})\), where m denotes the length of the sample path and \(n\) the length of the drifting Markov chain. Two cases can be considered:

m=n (a complete sample path),
m < n (an incomplete sample path).

One can also observe \(H\) i.i.d. sample paths :

It is denoted by \(H_i(m_i,n_i), i=1, \ldots, H\). Two cases cases are considered :

\(m_i=n_i=n \forall i=1, \ldots, H\) (complete sample paths of drifting Markov chains of the same length),
\(n_i=n \forall i=1, \ldots, H\) (incomplete sample paths of drifting Markov chains of the same length). In this case, an usual LSE over the sample paths is used.

The initial distribution of a k-th order drifting Markov Model is defined as \(\mu_i = P(X_1 = i)\). The initial distribution of the k first letters is freely customisable by the user, but five methods are proposed for the estimation of the latter :

Estimation based on the Maximum Likelihood Estimator:: The Maximum Likelihood Estimator for the initial distribution. The formula is: \(\widehat{\mu_i} = \frac{Nstart_i}{L}\), where \(Nstart_i\) is the number of occurences of the word \(i\) (of length \(k\)) at the beginning of each sequence and \(L\) is the number of sequences. This estimator is reliable when the number of sequences \(L\) is high.
Estimation based on the frequency:: The initial distribution is estimated by taking the frequences of the words of length k for all sequences. The formula is \(\widehat{\mu_i} = \frac{N_i}{N}\), where \(N_i\) is the number of occurences of the word \(i\) (of length \(k\)) in the sequences and \(N\) is the sum of the lengths of the sequences.
Estimation based on the product of the frequences of each state:: The initial distribution is estimated by using the product of the frequences of each state (for all the sequences) in the word of length \(k\).
Estimation based on the stationary law of point of support transition matrix for a word of length k :: The initial distribution is estimated using \(\mu(\Pi_{\frac{k-1}{n}}) \)
Estimation based on the uniform law :: \(\frac{1}{s}\)

References

BaVe2018drimmR Ver08drimmR

Examples

Run this code

data(lambda, package = "drimmR")
states <- c("a","c","g","t")
order <- 1
degree <- 1
fitdmm(lambda,order,degree,states, init.estim = "freq",fit.method="sum")

Run the code above in your browser using DataLab