plot.stslist.rep
for the plot method and seqplot
for other plot options.seqrep(seqdata, criterion="density", score=NULL,
decreasing=TRUE, trep=0.25, nrep=NULL,
tsim=0.1, dmax=NULL, dist.matrix=NULL, weighted=TRUE, ...)
seqdef
function."freq"
(sequence
frequency), "density"
(neighborhood density), "mscore"
(mean state frequency), "dist"
(centrality) aTRUE
, i.e. descending. The first object in the candidate list
is then stsim
).NULL
(default), the size of the representative set is
controlled by trep
.dmax
. Defaults to 0.1 (10%). Sequence $y$ is
redundant to sequence $x$ when it is in the neighborhood of $x$, i.e., within a distance
tsim*dmax
from $xtsim*dmax
. If NULL
, the value of dmax
is derived from the dissimilarity matrix.seqdata
. If NULL
, the
matrix is computed by calling the seqdist
function. In that case, optional arguments to be passseqdef
.)
Set as FALSE
to ignore the weights.seqdist
function, mainly dist.method
specifying the
metric for computing the distance matrix, norm
for normalizing the distances, indel
and sm
fstslist.rep
. This is actually a
state sequence object (containing a list of state sequences) with the
following additional attributes:seqplot
function using the type="r"
argument, or the seqrplot
alias.score
argument or by specifying one of the following as criterion
argument: "freq"
(sequence frequency), "density"
(neighborhood density), "mscore"
(mean state frequency), "dist"
(centrality) and "dist"
(sequence likelihood).
With the sequence frequency criterion, the more frequent a
sequence the more representative it is supposed to be. Therefore, sequences are sorted in decreasing frequency order.
The neighborhood density is the
number---density---of sequences in the neighborhood of the
sequence. This requires to set the neighborhood radius
tsim
. Sequences are
sorted in decreasing density order.
The mean state frequency criterion is the mean value of the transversal frequencies of the successive states.
Let $s=s_{1}s_{2}\cdots s_{\ell}$ be a sequence of length $\ell$ and $(f_{s_1},
f_{s_2}, \ldots, f_{s_\ell})$ the frequencies of the states at (time-)position $(t_1,
t_2,\ldots t_{\ell})$. The mean state frequency is the sum of the state frequencies divided by the
sequence length
$$MSF(s)=\frac{1}{\ell} \sum_{i=1}^{\ell} f_{s_{i}}$$The lower and upper boundaries of $MSF$ are $0$ and $1$. $MSF$ is equal to $1$ when all the sequences
in the set are identical, i.e. when there is a single sequence pattern. The most representative sequence is the one with
the highest score.
The centrality criterion is the sum of distances to all other sequences. The
smallest the sum, the most representative the sequence.
The sequence likelihood $P(s)$ is defined as the product of the probability with which each of its observed
successive state is supposed to occur at its position.
Let $s=s_{1}s_{2} \cdots s_{\ell}$ be a sequence of length $\ell$. Then
$$P(s)=P(s_{1},1) \cdot P(s_{2},2) \cdots P(s_{\ell},\ell)$$
with $P(s_{t},t)$ the probability to observe state $s_t$ at position $t$.
The question is how to determinate the state probabilities $P(s_{t},t)$. One commonly used method for
computing them is to postulate a Markov Chain model, which can be of various order. The implemented criterion considers the
probabilities derived from the first order Markov model, that is each $P(s_{t},t)$, $t>1$ is set to the
transition rate $p(s_t|s_{t-1})$ estimated across sequences from the observations at positions $t$
and $t-1$. For $t=1$, we set $P(s_1,1)$ to the observed frequency of the state $s_1$ at position 1.
The likelihood $P(s)$ being generally very small, we use
$-\log P(s)$ as sorting criterion. The latter quantity reaches its minimum for
$P(s)$ equal to 1, which leads to sort the sequences in
ascending order of their score.
Use criterion="dist"
and nrep=1
to get the medoid and criterion="density"
and nrep=1
to get the densest sequence pattern.
For more details, see Gabadinho & Ritschard, 2013.seqplot
, plot.stslist.rep
, dissrep
, disscenter
## Defining a sequence object with the data in columns 10 to 25
## (family status from age 15 to 30) in the biofam data set
data(biofam)
biofam.lab <- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq <- seqdef(biofam, 10:25, labels=biofam.lab)
## Computing the distance matrix
costs <- seqsubm(biofam.seq, method="TRATE")
biofam.om <- seqdist(biofam.seq, method="OM", sm=costs)
## Representative set using the neighborhood density criterion
biofam.rep <- seqrep(biofam.seq, dist.matrix=biofam.om, criterion="density")
biofam.rep
summary(biofam.rep)
plot(biofam.rep)
Run the code above in your browser using DataLab