seqdist: Distances between sequences

Description

Compute pairwise distances between sequences or distances to a reference sequence. Several metrics are available: optimal matching (OM) and other metrics such as the longest common prefix (LCP), the longest common suffix (RLCP), the longest common subsequence (LCS), the Hamming distance (HAM) and the Dynamic Hamming Distance (DHD).

Usage

seqdist(seqdata, method, refseq=NULL, norm=FALSE, 
	indel=1, sm, with.miss = FALSE, full.matrix = TRUE)

Arguments

seqdata

a state sequence object defined with the seqdef function.

method

a character string indicating the metric to be used. One of "OM" (Optimal Matching), "LCP" (Longest Common Prefix), "RLCP" (reversed LCP, i.e. Longest Common Suffix), "LCS" (Longest Common Subsequence), "HAM" (Hamming distance), "DHD" (Dynamic Hamming dis

refseq

Optional reference sequence to compute the distances from. Can be the index of a sequence in the state sequence object or 0 for the most frequent sequence, or an external sequence passed as a sequence object with 1 row.

norm

if TRUE, the computed OM, LCP, RLCP or LCS distances are normalized to account for differences in sequence lengths. Default is FALSE. See details

indel

the insertion/deletion cost (OM method). Default is 1. Ignored with non OM metrics.

substitution-cost matrix (OM, HAM and DHD method). Default is NA. Ignored with LCP, RLCP and LCS metrics.

with.miss

must be set to TRUE when sequences contain non deleted gaps (missing values). See details.

full.matrix

If TRUE (default), the full distance matrix is returned. This is for compatibility with earlier versions of the seqdist function. If FALSE, an object of class dist is returned, that is, a vector

Value

When refseq is specified, a vector with distances between the sequences in the data sequence object and the reference sequence is returned. When refseq is NULL (default), the whole matrix of pairwise distances between sequences is returned.

encoding

latin1

Details

The seqdist function returns a matrix of distances between sequences or a vector of distances to a reference sequence. The available metrics (see 'method' option) are optimal matching ("OM"), longest common prefix ("LCP"), longest common suffix ("RLCP"), longest common subsequence ("LCS"), Hamming distance ("HAM") and Dynamic Hamming Distance ("DHD"). The Hamming distance is OM without indels and the Dynamic Hamming Distance is HAM with specific substitution costs at each position as proposed by Lesnard (2006). Note that HAM and DHD apply only to sequences of equal length. For OM, HAM and DHD, a user specified substitution cost matrix can be provided with the sm argument. For DHD, this should be a series of matrices grouped in a 3-dimensional matrix with the third index referring to the position in the sequence. When sm is not specified, a constant substitution cost of 1 used with HAM, and Lesnard (2006)'s proposal for DHD. Distances can optionally be normalized by means of the norm argument. If set to TRUE, Elzinga's normalization (similarity divided by geometrical mean of the two sequence lengths) is applied to LCP, RLCP and LCS distances, while Abbott's normalization (distance divided by length of the longer sequence) is used for OM, HAM and DHD. For more details, see Elzinga (2008) and Gabadinho et al. (2009). When sequences contain gaps and the gaps=NA option was passed to seqdef, i.e. when there are non deleted missing values, the with.miss argument should be set to TRUE. If left to FALSE the function stops when it encounters a gap. This is to make the user aware that there are gaps in his sequences. If "OM" method is selected, seqdist expects a substitution cost matrix with a row and a column entry for the missing state (symbol defined with the nr option of seqdef). This will be the case for substitution cost matrices returned by seqsubm. More details on how to compute distances with sequences containing gaps are given in Gabadinho et al. (2009).

References

Elzinga, Cees H. (2008). Sequence analysis: Metric representations of categorical time series. Sociological Methods and Research, In revision. Gabadinho, A., G. Ritschard, M. Studer and N. S. M�ller (2009). Mining Sequence Data in R with TraMineR: A user's guide for version 1.1. Department of Econometrics and Laboratory of Demography, University of Geneva Lesnard, L. (2006) Optimal Matching and Social Sciences. S�rie des Documents de Travail du CREST, Institut National de la Statistique et des Etudes Economiques, Paris.

Examples

Run this code

## optimal matching distances with substitution cost matrix 
## using transition rates
data(biofam)
biofam.seq <- seqdef(biofam, 10:25)
costs <- seqsubm(biofam.seq, method="TRATE")
biofam.om <- seqdist(biofam.seq, method="OM", indel=3, sm=costs)

## normalized LCP distances
biofam.lcp <- seqdist(biofam.seq, method="LCP", norm=TRUE)

## normalized LCS distances to the most frequent sequence in the data set
biofam.lcs <- seqdist(biofam.seq, method="LCS", refseq=0, norm=TRUE)

## histogram of the normalized LCS distances
hist(biofam.lcs)

## =====================
## Example with missings
## =====================
data(ex1)
ex1.seq <- seqdef(ex1,1:13)

subm <- seqsubm(ex1.seq, method="TRATE", with.miss=TRUE)
ex1.om <- seqdist(ex1.seq, method="OM", sm=subm, with.miss=TRUE)

Run the code above in your browser using DataLab