seqdist: Distances (dissimilarities) between sequences

Description

Computes pairwise dissimilarities between sequences or dissimilarity from a reference sequence. Several dissimilarity measures can be chosen, including optimal matching (OM) and many of its variants, distance based on the count of common attributes, and distances between sequence state distributions.

Usage

seqdist(seqdata, method, refseq = NULL, norm = "none", indel = 1.0, sm = NULL,
  with.missing = FALSE, full.matrix = TRUE, kweights = rep(1.0, ncol(seqdata)),
  tpow = 1.0, expcost = 0.5, context, link = "mean", h = 0.5, nu,
  transindel = "constant", otto, previous = FALSE, add.column = TRUE,
  breaks = NULL, step = 1, overlap = FALSE, weighted = TRUE, 
  global.pdotj = NULL, prox = NULL)

Arguments

seqdata

State Sequence Object. The sequence data to use. It can be created with the seqdef function.

method

String. The dissimilarity measure to use. It can be "OM", "OMloc", "OMslen", "OMspell", "OMstran", "HAM", "DHD", "CHI2", "EUCLID", "LCS", "LCP", "RLCP", "NMS", "NMSMST", "SVRspell", or "TWED". See the Details section.

refseq

NULL, Integer, or State Sequence Object. Default: NULL. The baseline sequence to compute the distances from.

The most frequent sequence (0) or a sequence in seqdata at a specified index (strictly greater than 0) when an integer and method is one "OM", "OMloc", "OMslen", "OMspell", "HAM", "DHD", "LCS", "LCP", "RLCP", "NMS", "NMSMST", "SVRspell", or "TWED".

An external sequence when a state sequence object and method is one of "OM", "HAM", "DHD", "LCS", "LCP", or "RLCP". It must have a single row and the same alphabet as seqdata.

norm

String. Default: "none". The normalization to use when method is one of "OM", "HAM", "DHD", "LCS", "LCP", "RLCP", "CHI2", "EUCLID". It can be "none", "auto", or, except for "CHI2" and "EUCLID", "maxlength", "gmean", "maxdist", or "YujianBo". "auto" is equivalent to "maxlength" when method is one of "OM", "HAM", or "DHD", to "gmean" when method is one of "LCS", "LCP", or "RLCP". See the Details section.

indel

Double or Vector of Doubles. Default: 1.0. Insertion/deletion cost(s).

The single state-independent insertion/deletion cost when a double and method is one of "OM", "OMslen", "OMspell", "OMstran", or "TWED".

The state-dependent insertion/deletion costs when a vector of doubles and method = "OM" or method = "OMstran". It contains an indel cost for each state in the same order as the alphabet.

NULL, Matrix, Array, or String. Substitution costs. Default: NULL.

The substitution-cost matrix when a matrix and method is one of "OM", "OMloc", "OMslen", "OMspell", "OMstran", "HAM", or "TWED".

The series of the substitution-cost matrices when an array and method = "DHD". They are grouped in a 3-dimensional array with the third index referring to the position in the sequence.

The name of a seqcost method when a string and method is one of "OM", "OMloc", "OMslen", "OMspell", "OMstran", "HAM", "DHD", or "TWED". The method is used to build sm. It can be "INDELS" or "INDELSLOG" for "OM", "OMloc", "OMslen", "OMspell", "OMstran", "HAM", and "TWED", "CONSTANT" for "OM" and "HAM", "TRATE" for "OM", "HAM", and "DHD".

sm is mandatory when method is one of "OM", "OMloc", "OMslen", "OMspell", "OMstran", or "TWED".

sm is autogenerated when method is one of "HAM" or "DHD" and sm = NULL. See the Details section.

Note: With method = "NMS" or method = "SVRspell", see prox instead.

with.missing

Logical. Default: FALSE. When method isn't "OMslen" or "OMstran", should the non-deleted gap (missing value) be added to the alphabet as an additional state? If FALSE and seqdata or refseq contains such gaps, an error is raised.

full.matrix

Logical. Default: TRUE. When refseq = NULL, if TRUE, the full distance matrix is returned, if FALSE, an object of class dist is returned, that is, a vector containing only values from the upper triangle of the distance matrix. Objects of class dist are smaller and can be passed directly as arguments to most clustering functions.

kweights

Vector of Doubles. Default: vector of 1.0. The weights applied to subsequences when method is one of "NMS", "NMSMST", or "SVRspell". It contains at position \(k\) the weight applied to the subsequences of length \(k\). It must be positive. Its length must be equal to the number of columns of seqdata.

tpow

Double. Default: 1.0. The exponential weight of spell length when method is one of "OMspell", "NMSMST", or "SVRspell".

expcost

Double. Default: 0.5. The cost of spell length transformation when method = "OMloc" or method = "OMspell". It must be positive. The exact interpretation is distance-dependent.

context

Double. Default: 1-2*expcost. The cost of local insertion when method = "OMloc". It must be positive.

link

String. Default: "mean". The function used to compute substitution costs when method = "OMslen". One of "mean" (arithmetic average) or "gmean" (geometric mean as in the original proposition of Halpin 2010).

Double. Default: 0.5. It must be greater than or equal to 0.

The exponential weight of spell length when method = "OMslen".

The gap penalty when method = "TWED". It corresponds to the lambda in Halpin (2014), p 88.

Double. Stiffness when method = "TWED". It must be strictly greater than 0. See Halpin (2014), p 88.

transindel

String. Default: "constant". Method for computing transition indel costs when method = "OMstran". One of "constant" (single indel of 1.0), "subcost" (based on substitution costs), or "prob" (based on transition probabilities).

otto

Double. The origin-transition trade-off weight when method = "OMstran". It must be in [0, 1].

Logical. Default: FALSE. When method = "OMstran", should we also account for the transition from the previous state?

add.column

Logical. Default: TRUE. When method = "OMstran", should the last column (and also the first column when previous = TRUE) be duplicated?

breaks

NULL, List of pairs Integers. Default: NULL. The list of the possibly overlapping intervals when method = "CHI2" or method = "EUCLID".

step

Integer. Default: 1. The length of the intervals when method = "CHI2" or method = "EUCLID" and breaks = NULL. It must be positive. It must also be even when overlap = TRUE.

overlap

Logical. Default: FALSE. When method = "CHI2" or method = "EUCLID" and breaks = NULL, should the intervals overlap?

weighted

Logical. Default: TRUE. When method is "CHI2", should the distributions of the states account for the sequence weights in seqdata? See seqdef.

global.pdotj

Numerical vector, "obs", or NULL. Default: NULL. Only for method = "CHI2". The vector of state proportions to be used as marginal distribution. When NULL, the state distribution on the corresponding interval is used. When "obs", the overall state distribution in seqdata is used for all intervals. When a vector of proportions, it is used as marginal distribution for all intervals.

prox

NULL or Matrix. Default: NULL. The matrix of state proximities when method = "NMS" or method = "SVRspell".

Value

When refseq is NULL (default), the whole matrix of pairwise distances between sequences or, if full.matrix = FALSE, the corresponding dist object of pairwise distances between sequences is returned. Otherwise, a vector with distances between the sequences in the state sequence object and the reference sequence specified with refseq is returned.

Details

The seqdist function returns a matrix of distances between sequences or a vector of distances from the reference sequence when refseq is set. The available metrics (see method option) include:

Edit distances: optimal matching ("OM"), localized OM ("OMloc"), spell-length-sensitive OM ("OMslen"), OM of spell sequences ("OMspell"), OM of transition sequences ("OMstran"), Hamming ("HAM"), dynamic Hamming ("DHD"), and the time warp edit distance ("TWED").
Metrics based on counts of common attributes: distance based on the longest common subsequence ("LCS"), on the longest common prefix ("LCP"), on the longest common suffix ("RLCP"), on the number of matching subsequences ("NMS"), on the number of matching subsequences weighted by the minimum shared time ("NMSMST") and, the subsequence vectorial representation distance ("SVRspell").
Distances between state distributions: Euclidean ("EUCLID"), Chi-squared ("CHI2").

See Studer and Ritschard (2014, 2016) for a description and the comparison of the above dissimilarity measures except "TWED" for which we refer to Marteau (2009) and Halpin (2014).

Each method can be controlled with the following parameters:

method	parameters
------------------	---------------------------------
`OM`	`sm, indel, norm, refseq`
`OMloc`	`sm, expcost, context, refseq`
`OMslen`	`sm, indel, link, h, refseq`
`OMspell`	`sm, indel, tpow, expcost, refseq`
`OMstran`	`sm, indel, transindel, otto, previous, add.column`
`HAM, DHD`	`sm, norm, refseq`
`CHI2`	`breaks, step, overlap, norm, weighted, global.pdotj`
`EUCLID`	`breaks, step, overlap, norm`
`LCS, LCP, RLCP`	`norm, refseq`
`NMS`	`prox, kweights, refseq`
`NMSMST`	`kweights, tpow, refseq`
`SVRspell`	`prox, kweights, tpow, refseq`
`TWED`	`sm, indel, h, nu, refseq`

"LCS" is "OM" with a substitution cost of 2 (sm = "CONSTANT", cval = 2) and an indel of 1.0. "HAM" is "OM" without indels. "DHD" is "HAM" with specific substitution costs at each position.

"HAM" and "DHD" apply only to sequences of equal length.

When sm = NULL, the substitution-cost matrix is automatically created for "HAM" with a single substitution cost of 1 and for "DHD" with the costs derived from the transition rates at the successive positions.

Some distances can optionally be normalized by means of the norm argument. If set to "auto", Elzinga's normalization (similarity divided by geometrical mean of the two sequence lengths) is applied to "LCS", "LCP" and "RLCP" distances, while Abbott's normalization (distance divided by length of the longer sequence) is used for "OM", "HAM" and "DHD". Elzinga's method can be forced with "gmean" and Abbott's rule with "maxlength". With "maxdist" the distance is normalized by its maximal possible value. For more details, see Gabadinho et al. (2009, 2011). Finally, "YujianBo" is the normalization proposed by Yujian and Bo (2007) that preserves the triangle inequality. The square of the "CHI2" and "EUCLID" distances are normalized by the number of intervals and by the maximal distance on each interval. Note that for 'CHI2' the maximal distance on each interval depends on the state distribution on the interval.

When sequences contain gaps and the gaps = NA option was passed to seqdef (i.e. when there are non deleted missing values), the with.missing argument should be set as TRUE. If left as FALSE the function stops when it encounters a gap. This is to make the user aware that there are gaps in the sequences. For methods that need an sm value, seqdist expects a substitution-cost matrix with a row and a column entry for the missing state (symbol defined with the nr option of seqdef). Substitution-cost matrices returned by seqcost (and so seqsubm) include these additional entries when the function is called with with.missing = TRUE. More details on how to compute distances with sequences containing gaps can be found in Gabadinho et al. (2009).

References

Studer, M. and G. Ritschard (2016), "What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures", Journal of the Royal Statistical Society, Series A. 179(2), 481-511. DOI: 10.1111/rssa.12125

Studer, M. and G. Ritschard (2014). "A Comparative Review of Sequence Dissimilarity Measures". LIVES Working Papers, 33. NCCR LIVES, Switzerland. DOI: 10.12682/lives.2296-1658.2014.33

Gabadinho, A., G. Ritschard, N. S. M<U+00FC>ller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software 40(4), 1--37.

Gabadinho, A., G. Ritschard, M. Studer and N. S. M<U+00FC>ller (2009). Mining Sequence Data in R with the TraMineR package: A user's guide Department of Econometrics and Laboratory of Demography, University of Geneva

Halpin, B. (2014). Three Narratives of Sequence Analysis, in Blanchard, P., B<U+00FC>hlmann, F. and Gauthier, J.-A. (Eds.) Advances in Sequence Analysis: Theory, Method, Applications, Vol 2 of Series Life Course Research and Social Policies, pages 75--103, Heidelberg: Springer. DOI: 10.1007/978-3-319-04969-4_5

Marteau, P.-F. (2009). Time Warp Edit Distances with Stiffness Adjustment for Time Series Matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2), 306--318. DOI: 10.1109/TPAMI.2008.76

Yujian, L. and Bo, L. (2007). A normalized Levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 1091--1095. DOI: 10.1109/TPAMI.2007.1078

See also all references in Studer and Ritschard (2014, 2016)

Examples

Run this code

# NOT RUN {
## ========================
## Example without missings
## ========================

## Defining a sequence object with columns 10 to 25 of a
## subset of the 'biofam' data set
data(biofam)
biofam.seq <- seqdef(biofam[501:600, 10:25])

## OM distances with a substitution-cost matrix derived
## from transition rates
biofam.om <- seqdist(biofam.seq, method = "OM", indel = 3,
  sm = "TRATE")

## OM distances using the vector of estimated indels and
## substitution costs derived from the estimated indels
costs <- seqcost(biofam.seq, method = "INDELSLOG")
biofam.om <- seqdist(biofam.seq, method = "OM",
  indel = costs$indel, sm = costs$sm)

## Normalized LCP distances
biofam.lcp.n <- seqdist(biofam.seq, method = "LCP",
  norm = "auto")

## Normalized LCS distances to the most frequent sequence
biofam.dref1 <- seqdist(biofam.seq, method = "LCS",
  refseq = 0, norm = "auto")

## LCS distances to an external sequence
ref <- seqdef(as.matrix("(0,5)-(3,5)-(4,6)"), informat = "SPS",
  alphabet = alphabet(biofam.seq))
biofam.dref2 <- seqdist(biofam.seq, method = "LCS",
  refseq = ref)

## Chi-squared distance over the full observed timeframe
biofam.chi.full <- seqdist(biofam.seq, method = "CHI2",
  step = max(seqlength(biofam.seq)))

## Chi-squared distance over successive overlaping
## intervals of length 4
biofam.chi.ostep <- seqdist(biofam.seq, method = "CHI2",
  step = 4, overlap = TRUE)


## =====================
## Example with missings
## =====================
data(ex1)
ex1.seq <- seqdef(ex1[, 1:13])

## OM with substitution costs based on transition
## probabilities and indel set as half the maximum
## substitution cost
costs.tr <- seqcost(ex1.seq, method = "TRATE",
  with.missing = TRUE)
ex1.om <- seqdist(ex1.seq, method = "OM",
  indel = costs.tr$indel, sm = costs.tr$sm,
  with.missing = TRUE)

## Localized OM
ex1.omloc <- seqdist(ex1.seq, method = "OMloc",
  indel = costs.tr$indel, sm = costs.tr$sm,
  with.missing = TRUE)

## OM of spells
ex1.omspell <- seqdist(ex1.seq, method = "OMspell",
  sm = costs.tr$sm, indel = costs.tr$indel,
  with.missing = TRUE)

## Distance based on number of matching subsequences
ex1.nms <- seqdist(ex1.seq, method = "NMS",
  with.missing = TRUE)

## Using the sequence vetorial representation metric
costs.fut <- seqcost(ex1.seq, method = "FUTURE", lag = 4,
  proximities = TRUE, with.missing = TRUE)
ex1.svr <- seqdist(ex1.seq, method = "SVRspell",
  prox = costs.fut$prox, with.missing = TRUE)
# }

Run the code above in your browser using DataLab