fitmm: Maximum Likelihood Estimation (MLE) of a k-th order Markov chain

Description

Maximum Likelihood Estimation of the transition matrix and initial distribution of a k-th order Markov chain starting from one or several sequences.

Usage

fitmm(sequences, states, k = 1, init.estim = "mle")

Arguments

sequences

A list of vectors representing the sequences.

states

Vector of state space (of length s).

Order of the Markov chain.

init.estim

Optional. init.estim gives the method used to estimate the initial distribution. The following methods are proposed:

init.estim = "mle": the classical Maximum Likelihood Estimator is used to estimate the initial distribution init;
init.estim = "stationary": the initial distribution is replaced by the stationary distribution of the Markov chain (if the order of the Markov chain is more than one, the transition matrix is converted into a square block matrix in order to estimate the stationary distribution);
init.estim = "freq": the initial distribution is estimated by taking the frequencies of the words of length k for all sequences;
init.estim = "prod": init is estimated by using the product of the frequencies of each letter (for all the sequences) in the word of length k;
init.estim = "unif": the initial probability of each state is equal to \(1 / s\), with \(s\) the number of states.

Value

An object of class mm.

Details

Let \(X_1, X_2, ..., X_n\) be a trajectory of length \(n\) of the Markov chain \(X = (X_m)_{m \in N}\) of order \(k = 1\) with transition matrix \(p_{trans}(i,j) = P(X_{m+1} = j | X_m = i)\). The maximum likelihood estimation of the transition matrix is \(\widehat{p_{trans}}(i,j) = \frac{N_{ij}}{N_{i.}}\), where \(N_{ij}\) is the number of transitions from state \(i\) to state \(j\) and \(N_{i.}\) is the number of transition from state \(i\) to any state. For \(k > 1\) we have similar expressions.

The initial distribution of a k-th order Markov chain is defined as \(\mu_i = P(X_1 = i)\). Five methods are proposed for the estimation of the latter :

Maximum Likelihood Estimator:: The Maximum Likelihood Estimator for the initial distribution. The formula is: \(\widehat{\mu_i} = \frac{Nstart_i}{L}\), where \(Nstart_i\) is the number of occurences of the word \(i\) (of length \(k\)) at the beginning of each sequence and \(L\) is the number of sequences. This estimator is reliable when the number of sequences \(L\) is high.
Stationary distribution:: The stationary distribution is used as a surrogate of the initial distribution. If the order of the Markov chain is more than one, the transition matrix is converted into a square block matrix in order to estimate the stationary distribution. This method may take time if the order of the Markov chain is high (more than three (3)).
Frequencies of each word:: The initial distribution is estimated by taking the frequencies of the words of length k for all sequences. The formula is \(\widehat{\mu_i} = \frac{N_i}{N}\), where \(N_i\) is the number of occurences of the word \(i\) (of length \(k\)) in the sequences and \(N\) is the sum of the lengths of the sequences.
Product of the frequencies of each state:: The initial distribution is estimated by using the product of the frequencies of each state (for all the sequences) in the word of length \(k\).
Uniform distribution:: The initial probability of each state is equal to \(1 / s\), with \(s\), the number of states.

Examples

Run this code

# NOT RUN {
states <- c("a", "c", "g", "t")
s <- length(states)
k <- 2
vect.init <- rep.int(1 / s ^ k, s ^ k)
p <- matrix(0.25, nrow = s ^ k, ncol = s)

# Specify the Markov model
markov1 <- mm(states = states, init = vect.init, ptrans = p, k = k)

seq1 <- simulate(object = markov1, nsim = c(1000, 10000, 2000), seed = 150)

est <- fitmm(sequences = seq1, states = states, k = 2)

# }