seqrep(seqdata, criterion="density", score=NULL,
decreasing=TRUE, trep=0.25, nrep=NULL,
tsim=0.1, dmax=NULL, dist.matrix=NULL, ...)
seqdef
function."freq"
(sequence
frequency), "density"
(neighborhood density), "mscore"
(mean state frequency), "dist"
(centrality) atsim
).NULL
(default), the size of the representative set is
controlled by trep
.NULL
, it is derived from the distance matrix.seqdata
. If NULL
, the
matrix is computed by calling the seqdist
function. In that case, optional arguments toseqdist
function, mainly dist.method
specifying the
metric for computing the distance matrix, norm
for normalizing the distances, indel
and sm
fstslist.rep
. This is actually a
state sequence object (containing a list of state sequences) with the
following additional attributes:seqplot
function using the type="r"
argument, or the seqrplot
alias.tsim
. We suggest to set it as a given proportion of the
maximal theoretical distance between two sequences. Sequences are
sorted in decreasing density order.
The mean state frequency criterion is the mean value of the transversal frequencies of the successive states.
Let $s=s_{1}s_{2}\cdots s_{\ell}$ be a sequence of length $\ell$ and $(f_{s_1},
f_{s_2}, \ldots, f_{s_\ell})$ the frequencies of the states at (time-)position $(t_1,
t_2,\ldots t_{\ell})$. The mean state frequency is the sum of the state frequencies divided by the
sequence length
$$MSF(s)=\frac{1}{\ell} \sum_{i=1}^{\ell} f_{s_{i}}$$The lower and upper boundaries of $MSF$ are $0$ and $1$. $MSF$ is equal to $1$ when all the sequences
in the set are the same, i.e. when there is a single distinct sequence. The most representative sequence is the one with
the highest score.
The centrality criterion uses the sum of distances to all other sequences as a representativeness criterion. The
smallest the sum, the most representative the sequence.
The sequence likelihood $P(s)$ is defined as the product of the probability with which each of its observed
successive state is supposed to occur at its position.
Let $s=s_{1}s_{2} \cdots s_{\ell}$ be a sequence of length $\ell$. Then
$$P(s)=P(s_{1},1) \cdot P(s_{2},2) \cdots P(s_{\ell},\ell)$$
with $P(s_{t},t)$ the probability to observe state $s_t$ at position $t$.
The question is how to determinate the state probabilities $P(s_{t},t)$. One commonly used method for
computing them is to postulate a Markov model, which can be of various order. The implemented criterion considers the
probabilities derived from the first order Markov model, that is each $P(s_{t},t)$, $t>1$ is set to the
transition rate $p(s_t|s_{t-1})$ estimated across sequences from the observations at positions $t$
and $t-1$. For $t=1$, we set $P(s_1,1)$ to the observed frequency of the state $s_1$ at position 1.
The likelihood $P(s)$ being generally very small, we use
$-\log P(s)$ as sorting criterion. The latter quantity is minimal
when $P(s)$ is equal to 1, which leads to sort the sequences in
ascending order of their score.
For more details, see Gabadinho et al., 2009.seqplot
, plot.stslist.rep
## Defining a sequence object with the data in columns 10 to 25
## (family status from age 15 to 30) in the biofam data set
data(biofam)
biofam.lab <- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq <- seqdef(biofam, 10:25, labels=biofam.lab)
## Computing the distance matrix
costs <- seqsubm(biofam.seq, method="TRATE")
biofam.om <- seqdist(biofam.seq, method="OM", sm=costs)
## Representative set using the neighborhood density criterion
biofam.rep <- seqrep(biofam.seq, dist.matrix=biofam.om, criterion="density")
biofam.rep
summary(biofam.rep)
plot(biofam.rep)
Run the code above in your browser using DataLab