seqemlt: Euclidean Coordinates for Longitudinal Timelines

Description

Computes the Euclidean coordinates of sequences from which we get the EMLT distance between sequences introduced in Rousset et al (2012).

Usage

seqemlt(seqdata, a = 1, b = 1, weighted = TRUE)

Arguments

seqdata

a state sequence object defined with the seqdef function.

optional argument for the weighting mechanism that controls the balancing between short term/long term transitions. The weighting function is \(1/(a*s+b)\) where \(s\) is the transition step.

see argument a.

weighted

Logical: Should weights in the sequence object seqdata be used?

Value

An object of class emlt with the following components:

coord

Matrix with in each row the EMLT numerical coordinates of the corresponding sequence.

states

list of states

situations

list of situations (timestamped states)

sit.freq

Situation frequencies

sit.transrate

matrix of transition probabilities from each situation to future situations

sit.profil

profiles of situations. Each profile is the normalized vector of transition probabilities to future situations adjusted to down weight transitions over longer periods.

sit.cor

Matrix of correlations between situations. Two situations are highly correlated when their profiles are similar (i.e., when their transitions towards future are similar).

Details

The EMLT distance is the sum of the dissimilarity between the pairs of states observed at the successive positions, where the dissimilarity between states is defined at each position as the Chi-squared distance between the normalized vectors of transition probabilities (profiles of situations) from the current state to the next observed states in the sequence. Transition probabilities are down-weighted with the time distance to avoid exaggerated importance of transitions over long periods. The adjustment weight is \(1/a*s+b\), where \(s\) is the period length over which the transition probability is measured.

The EMLT distance between two sequences is obtained as the Euclidean distance between the returned numerical sequence coordinates. So, providing coord as the data input to any clustering algorithm that uses the Euclidean metric is equivalent to cluster with the EMLT metric.

Each time-indexed state is called a situation, and the distance between two states at a position \(t\) is derived from the transition probabilities to other observed situations. The distance between any situation and a situation that does not occur is coded as NA. Such non-occurring situations have no influence on the distance between sequences.

The obtained numerical representations of sequences may be used as input to any Euclidean algorithm (clustering algorithms, ...).

References

Rousset, Patrick and Jean-Fran<U+00E7>ois Giret (2007), Classifying Qualitative Time Series with SOM: The Typology of Career Paths in France, in F. Sandoval, A. Prieto and M. Grana (Eds) Computational and Ambient Intelligence, Lecture Notes in Computer science, vol 4507, Berlin: Springer, pp 757-764.

Rousset, Patrick, Jean-Fran<U+00E7>ois Giret and Yvette Grelet (2012) Typologies De Parcours et Dynamique Longitudinale, Bulletin de m<U+00E9>thodologie sociologique, 114(1), 5-34.

Rousset, Patrick and Jean-Fran<U+00E7>ois Giret (2008) A longitudinal Analysis of Labour Market Data with SOM, in J. Rabu<U+00F1>al Dopico, J. Dorado, & A. Pazos (Eds.) Encyclopedia of Artificial Intelligence, Hershey, PA: Information Science Reference, pp 1029-1035.

Studer, Matthias and Gilbert Ritschard (2014) A comparative review of sequence dissimilarity measures. LIVES Working Paper, 33 http://dx.doi.org/10.12682/lives.2296-1658.2014.33

Examples

Run this code

# NOT RUN {
data(mvad)
mvad.seq <- seqdef(mvad[1:100, 17:41])
alphabet(mvad.seq)
head(labels(mvad.seq))
## Computing distance
mvad.emlt <- seqemlt(mvad.seq)

## typology1 with kmeans in 3 clusters
km <- kmeans(mvad.emlt$coord, 3)

##Plotting by clusters of typology1
seqdplot(mvad.seq, group=km$cluster)

## typology2: 3 clusters by applying hierarchical ward
##   on the centers of the 25 group kmeans solution
km<-kmeans(mvad.emlt$coord, 25)
hc<-hclust(dist(km$centers, method="euclidean"), method="ward")
zz<-cutree(hc, k=3)

##Plotting by clusters of typology2
seqdplot(mvad.seq, group=zz[km$cluster])

## Plotting the evolution of the correlation between states
plot(mvad.emlt, from="employment", to="joblessness", type="cor")
plot(mvad.emlt, from=c("employment","HE", "school", "FE"), to="joblessness", delay=0, leg=TRUE)
plot(mvad.emlt, from="joblessness", to="employment", delay=6)
plot(mvad.emlt, type="pca", cex=0.4, compx=1, compy=2)

# }

Run the code above in your browser using DataLab