seqdist: Distances (dissimilarities) between sequences

Description

Computes pairwise dissimilarities between sequences or dissimilarities with a reference sequence. Several dissimilarities measures or metrics are available: optimal matching (OM), distance based on the longest common prefix (LCP), on the longest common suffix (RLCP), on the longest common subsequence (LCS), the Hamming distance (HAM) and the Dynamic Hamming Distance (DHD).

Usage

seqdist(seqdata, method, refseq=NULL, norm=FALSE,
     indel=1, sm=NA, with.missing=FALSE, full.matrix=TRUE)

Arguments

seqdata

a state sequence object defined with the seqdef function.

method

a character string indicating the metric to be used. One of "OM" (Optimal Matching), "LCP" (Longest Common Prefix), "RLCP" (reversed LCP, i.e. Longest Common Suffix), "LCS" (Longest Common Subsequence),

refseq

Optional baseline sequence to compute the distances from. Can be the index of a sequence in the state sequence object, 0 for the most frequent sequence, or an external sequence passed as a sequence object with 1 row.

norm

if TRUE, the computed OM, LCP, RLCP or LCS distances are normalized to account for differences in sequence lengths, and the normalization method is automatically selected. Default is FALSE. Can also be one of "none"

indel

the insertion/deletion cost (OM method). Default is 1. Ignored with non OM metrics.

substitution-cost matrix (OM, HAM and DHD method). Can also be one of the seqsubm build methods "TRATE" or "CONSTANT". Default is NA. Ignored with LCP, RLCP and LCS me

with.missing

must be set to TRUE when sequences contain non deleted gaps (missing values). See details.

full.matrix

If TRUE (default), the full distance matrix is returned. This is for compatibility with earlier versions of the seqdist function. If FALSE, an object of class dist is re

Value

When refseq is specified, a vector with distances between the sequences in the data sequence object and the reference sequence is returned. When refseq is NULL (default), the whole matrix of pairwise distances between sequences is returned.

encoding

latin1

Details

The seqdist function returns a matrix of distances between sequences or a vector of distances to a reference sequence. The available metrics (see 'method' option) are optimal matching ("OM"), longest common prefix ("LCP"), longest common suffix ("RLCP"), longest common subsequence ("LCS"), Hamming distance ("HAM") and Dynamic Hamming Distance ("DHD"). The Hamming distance is OM without indels and the Dynamic Hamming Distance is HAM with specific substitution costs at each position as proposed by Lesnard (2006). Note that HAM and DHD apply only to sequences of equal length. For OM, HAM and DHD, a user specified substitution cost matrix can be provided with the sm argument. For DHD, this should be a series of matrices grouped in a 3-dimensional matrix with the third index referring to the position in the sequence. When sm is not specified, a constant substitution cost of 1 is used with HAM, and Lesnard (2006)'s proposal for DHD. Distances can optionally be normalized by means of the norm argument. If set to TRUE, Elzinga's normalization (similarity divided by geometrical mean of the two sequence lengths) is applied to LCP, RLCP and LCS distances, while Abbott's normalization (distance divided by length of the longer sequence) is used for OM, HAM and DHD. Elzinga's method can be forced with "gmean" and Abbott's rule with "maxlength". With "maxdist" the distance is normalized by its maximal possible value. For more details, see Elzinga (2008) and Gabadinho et al. (2009). When sequences contain gaps and the gaps=NA option was passed to seqdef, i.e. when there are non deleted missing values, the with.missing argument should be set to TRUE. If left to FALSE the function stops when it encounters a gap. This is to make the user aware that there are gaps in his sequences. If the OM method is selected, seqdist expects a substitution cost matrix with a row and a column entry for the missing state (symbol defined with the nr option of seqdef). This will be the case for substitution cost matrices returned by seqsubm. More details on how to compute distances with sequences containing gaps are given in Gabadinho et al. (2009).

References

Elzinga, Cees H. (2008). Sequence analysis: Metric representations of categorical time series. Technical Report, Department of Social Science Research Methods, Vrije Universiteit, Amsterdam. Gabadinho, A., G. Ritschard, N. S. M�ller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software 40(4), 1-37. Gabadinho, A., G. Ritschard, M. Studer and N. S. M�ller (2009). Mining Sequence Data in R with the TraMineR package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva Lesnard, L. (2006) Optimal Matching and Social Sciences. S�rie des Documents de Travail du CREST, Institut National de la Statistique et des Etudes Economiques, Paris.

Examples

Run this code

## optimal matching distances with substitution cost matrix
## derived from transition rates
data(biofam)
biofam.seq <- seqdef(biofam, 10:25)
costs <- seqsubm(biofam.seq, method="TRATE")
biofam.om <- seqdist(biofam.seq, method="OM", indel=3, sm=costs)

## normalized LCP distances
biofam.lcp <- seqdist(biofam.seq, method="LCP", norm=TRUE)

## normalized LCS distances to the most frequent sequence in the data set
biofam.lcs <- seqdist(biofam.seq, method="LCS", refseq=0, norm=TRUE)

## histogram of the normalized LCS distances
hist(biofam.lcs)

## =====================
## Example with missings
## =====================
data(ex1)
ex1.seq <- seqdef(ex1,1:13)

subm <- seqsubm(ex1.seq, method="TRATE", with.missing=TRUE)
ex1.om <- seqdist(ex1.seq, method="OM", sm=subm, with.missing=TRUE)

Run the code above in your browser using DataLab