seqrep: Extracting sets of representative sequences

Description

The function attempts to find an optimal (as small as possible while assuring a large coverage) set of representative sequences that exhibits the key features of the whole sequence data set, the goal being to get easy sounded interpretation of the latter.

Usage

seqrep(seqdata, criterion="density", score=NULL,
    decreasing=TRUE, trep=0.25, nrep=NULL,
    tsim=0.1, dmax=NULL, dist.matrix=NULL, ...)

Arguments

seqdata

a state sequence object as defined by the seqdef function.

criterion

the representativeness criterion for sorting the candidate list. One of "freq" (sequence frequency), "density" (neighborhood density), "mscore" (mean state frequency), "dist" (centrality) a

score

an optional vector containing the representativeness scores used to sort the sequences in the candidate list. The length of the vector must be equal to the number of sequences in the sequence object.

decreasing

if a score vector is provided, indicates whether the objects in the candidate list must be sorted in ascending or descending order of this score. Default is TRUE, i.e. descending. The first object in the candidate list is then supposed to be

trep

coverage threshold, i.e. minimum proportion of sequences that should have a representative in their neighborhood (neighborhood diameter is defined by tsim).

nrep

number of representative sequences. If NULL (default), the size of the representative set is controlled by trep.

tsim

threshold for setting the redundancy and neighborhood radius. Defined as a percentage of the maximum (theoretical) distance. Defaults to 0.1 (10%). Sequence $y$ is considered as redundant to/in the neighborhood of sequence $x$ if the distance from $y$

dmax

maximum theoretical distance. The neighborhood diameter is defined as a proportion of this maximum theoretical distance. If NULL, it is derived from the distance matrix.

dist.matrix

a matrix containing the pairwise distances between sequences in seqdata. If NULL, the matrix is computed by calling the seqdist function. In that case, optional arguments to

...

optional arguments to be passed to the seqdist function, mainly dist.method specifying the metric for computing the distance matrix, norm for normalizing the distances, indel and sm f

Value

An object of class stslist.rep. This is actually a state sequence object (containing a list of state sequences) with the following additional attributes:
Scoresa vector with the representative score of each sequence in the original set given the chosen criterion.
Distancesa matrix with the distance of each sequence to its nearest representative.
Statisticscontains several quality measures for each representative sequence in the set: number of sequences attributed to the representative, number of sequence in the representatives neighborhood, mean distance to the representative.
Qualityoverall quality measure.
Print,plot and summary methods are available. More elaborated plots are produced by the seqplot function using the type="r" argument, or the seqrplot alias.

encoding

latin1

Details

The representative set is obtained by an heuristic that first builds a sorted list of candidates using a representativeness score and then eliminates redundancy. The available criterions for sorting the candidate list are: sequence frequency, neighborhood density, mean state frequency, centrality and sequence likelihood. The sequence frequency criterion uses the sequence frequencies as representativeness score. The more frequent a sequence the more representative it is supposed to be. Hence, sequences are sorted in decreasing frequency order. The neighborhood density criterion uses the number---density---of sequences in the neighborhood of each candidate sequence. This requires indeed to set the neighborhood diameter tsim. We suggest to set it as a given proportion of the maximal theoretical distance between two sequences. Sequences are sorted in decreasing density order. The mean state frequency criterion is the mean value of the transversal frequencies of the successive states. Let $s=s_{1}s_{2}\cdots s_{\ell}$ be a sequence of length $\ell$ and $(f_{s_1}, f_{s_2}, \ldots, f_{s_\ell})$ the frequencies of the states at (time-)position $(t_1, t_2,\ldots t_{\ell})$. The mean state frequency is the sum of the state frequencies divided by the sequence length $$MSF(s)=\frac{1}{\ell} \sum_{i=1}^{\ell} f_{s_{i}}$$The lower and upper boundaries of $MSF$ are $0$ and $1$. $MSF$ is equal to $1$ when all the sequences in the set are the same, i.e. when there is a single distinct sequence. The most representative sequence is the one with the highest score. The centrality criterion uses the sum of distances to all other sequences as a representativeness criterion. The smallest the sum, the most representative the sequence. The sequence likelihood $P(s)$ is defined as the product of the probability with which each of its observed successive state is supposed to occur at its position. Let $s=s_{1}s_{2} \cdots s_{\ell}$ be a sequence of length $\ell$. Then $$P(s)=P(s_{1},1) \cdot P(s_{2},2) \cdots P(s_{\ell},\ell)$$ with $P(s_{t},t)$ the probability to observe state $s_t$ at position $t$. The question is how to determinate the state probabilities $P(s_{t},t)$. One commonly used method for computing them is to postulate a Markov model, which can be of various order. The implemented criterion considers the probabilities derived from the first order Markov model, that is each $P(s_{t},t)$, $t>1$ is set to the transition rate $p(s_t|s_{t-1})$ estimated across sequences from the observations at positions $t$ and $t-1$. For $t=1$, we set $P(s_1,1)$ to the observed frequency of the state $s_1$ at position 1. The likelihood $P(s)$ being generally very small, we use $-\log P(s)$ as sorting criterion. The latter quantity is minimal when $P(s)$ is equal to 1, which leads to sort the sequences in ascending order of their score. For more details, see Gabadinho et al., 2009.

References

Gabadinho, A., G. Ritschard, M. Studer and N. S. M�ller (2009). Summarizing Sets of Categorical Sequences, In International Conference on Knowledge Discovery and Information Retrieval, Madeira, 6-8 October, INSTICC.

Examples

Run this code

## Defining a sequence object with the data in columns 10 to 25
## (family status from age 15 to 30) in the biofam data set
data(biofam)
biofam.lab <- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq <- seqdef(biofam, 10:25, labels=biofam.lab)

## Computing the distance matrix
costs <- seqsubm(biofam.seq, method="TRATE")
biofam.om <- seqdist(biofam.seq, method="OM", sm=costs)

## Representative set using the neighborhood density criterion
biofam.rep <- seqrep(biofam.seq, dist.matrix=biofam.om, criterion="density")
biofam.rep
summary(biofam.rep)
plot(biofam.rep)

Run the code above in your browser using DataLab