Maximum Likelihood Estimation of the transition matrix and initial distribution of a k-th order Markov chain starting from one or several sequences.
fitmm(sequences, states, k = 1, init.estim = "mle")A list of vectors representing the sequences.
Vector of state space (of length s).
Order of the Markov chain.
Optional. init.estim gives the method used to estimate
the initial distribution. The following methods are proposed:
init.estim = "mle": the classical Maximum Likelihood Estimator
is used to estimate the initial distribution init;
init.estim = "stationary": the initial distribution is replaced by
the stationary distribution of the Markov chain (if the order of the
Markov chain is more than one, the transition matrix is converted
into a square block matrix in order to estimate the stationary
distribution);
init.estim = "freq": the initial distribution is estimated by
taking the frequencies of the words of length k for all sequences;
init.estim = "prod": init is estimated by using the product
of the frequencies of each letter (for all the sequences) in the word
of length k;
init.estim = "unif": the initial probability of each state is
equal to \(1 / s\), with \(s\) the number of states.
An object of class mm.
Let \(X_1, X_2, ..., X_n\) be a trajectory of length \(n\) of the Markov chain \(X = (X_m)_{m \in N}\) of order \(k = 1\) with transition matrix \(p_{trans}(i,j) = P(X_{m+1} = j | X_m = i)\). The maximum likelihood estimation of the transition matrix is \(\widehat{p_{trans}}(i,j) = \frac{N_{ij}}{N_{i.}}\), where \(N_{ij}\) is the number of transitions from state \(i\) to state \(j\) and \(N_{i.}\) is the number of transition from state \(i\) to any state. For \(k > 1\) we have similar expressions.
The initial distribution of a k-th order Markov chain is defined as \(\mu_i = P(X_1 = i)\). Five methods are proposed for the estimation of the latter :
The Maximum Likelihood Estimator for the initial distribution. The formula is: \(\widehat{\mu_i} = \frac{Nstart_i}{L}\), where \(Nstart_i\) is the number of occurences of the word \(i\) (of length \(k\)) at the beginning of each sequence and \(L\) is the number of sequences. This estimator is reliable when the number of sequences \(L\) is high.
The stationary distribution is used as a surrogate of the initial distribution. If the order of the Markov chain is more than one, the transition matrix is converted into a square block matrix in order to estimate the stationary distribution. This method may take time if the order of the Markov chain is high (more than three (3)).
The initial distribution is estimated
by taking the frequencies of the words of length k for all sequences.
The formula is \(\widehat{\mu_i} = \frac{N_i}{N}\), where \(N_i\)
is the number of occurences of the word \(i\) (of length \(k\)) in
the sequences and \(N\) is the sum of the lengths of the sequences.
The initial distribution is estimated by using the product of the frequencies of each state (for all the sequences) in the word of length \(k\).
The initial probability of each state is equal to \(1 / s\), with \(s\), the number of states.
# NOT RUN {
states <- c("a", "c", "g", "t")
s <- length(states)
k <- 2
vect.init <- rep.int(1 / s ^ k, s ^ k)
p <- matrix(0.25, nrow = s ^ k, ncol = s)
# Specify the Markov model
markov1 <- mm(states = states, init = vect.init, ptrans = p, k = k)
seq1 <- simulate(object = markov1, nsim = c(1000, 10000, 2000), seed = 150)
est <- fitmm(sequences = seq1, states = states, k = 2)
# }
Run the code above in your browser using DataLab