stringdist (version 0.9.12)

# seq_dist: Compute distance metrics between integer sequences

## Description

`seq_dist` computes pairwise string distances between elements of `a` and `b`, where the argument with less elements is recycled. `seq_distmatrix` computes the distance matrix with rows according to `a` and columns according to `b`.

## Usage

```seq_dist(
a,
b,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw"),
weight = c(d = 1, i = 1, s = 1, t = 1),
q = 1,
p = 0,
bt = 0,
)seq_distmatrix(
a,
b,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw"),
weight = c(d = 1, i = 1, s = 1, t = 1),
q = 1,
p = 0,
bt = 0,
useNames = c("names", "none"),
)```

## Value

`seq_dist` returns a numeric vector with pairwise distances between `a`

and `b` of length `max(length(a),length(b)`.

For `seq_distmatrix` there are two options. If `b` is missing, the `dist` object corresponding to the ```length(a) X length(a)``` distance matrix is returned. If `b` is specified, the `length(a) X length(b)` distance matrix is returned.

If any element of `a` or `b` is `NA_integer_`, the distance with any matched integer vector will result in `NA`. Missing values in the sequences themselves are treated as a number and not treated specially (Also see the examples).

## Arguments

a

(`list` of) `integer` or `numeric` vector(s). Will be converted with `as.integer` (target)

b

(`list` of) `integer` or `numeric` vector(s). Will be converted with `as.integer` (source). Optional for `seq_distmatrix`.

method

Distance metric. See `stringdist-metrics`

weight

For `method='osa'` or `'dl'`, the penalty for deletion, insertion, substitution and transposition, in that order. When `method='lv'`, the penalty for transposition is ignored. When `method='jw'`, the weights associated with characters of `a`, characters from `b` and the transposition weight, in that order. Weights must be positive and not exceed 1. `weight` is ignored completely when `method='hamming'`, `'qgram'`, `'cosine'`, `'Jaccard'`, or `'lcs'`

q

Size of the \(q\)-gram; must be nonnegative. Only applies to `method='qgram'`, `'jaccard'` or `'cosine'`.

p

Prefix factor for Jaro-Winkler distance. The valid range for `p` is `0 <= p <= 0.25`. If `p=0` (default), the Jaro-distance is returned. Applies only to `method='jw'`.

bt

Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than `bt` Applies only to `method='jw'` and `p>0`.

Maximum number of threads to use. By default, a sensible number of threads is chosen, see `stringdist-parallelization`.

useNames

label the output matrix with `names(a)` and `names(b)`?

## Notes

Input vectors are converted with `as.integer`. This causes truncation for numeric vectors (e.g. `pi` will be treated as `3L`).

`seq_sim`, `seq_amatch`, `seq_qgrams`

## Examples

Run this code
``````# Distances between lists of integer vectors. Note the postfix 'L' to force
# integer storage. The shorter argument is recycled over (\code{a})
a <- list(c(102L, 107L))                        # fu
b <- list(c(102L,111L,111L),c(102L,111L,111L))  # foo, fo
seq_dist(a,b)

# translate strings to a list of integer sequences
a <- lapply(c("foo","bar","baz"),utf8ToInt)
seq_distmatrix(a)

# Note how missing values are treated. NA's as part of the sequence are treated
# as an integer (the representation of NA_integer_).
a <- list(NA_integer_,c(102L, 107L))
b <- list(c(102L,111L,111L),c(102L,111L,NA_integer_))
seq_dist(a,b)

if (FALSE) {
# Distance between sentences based on word order. Note: words must match exactly or they
# are treated as completely different.
#
# For this example you need to have the 'hashr' package installed.
a <- "Mary had a little lamb"
a.words <- strsplit(a,"[[:blank:]]+")
a.int <- hashr::hash(a.words)
b <- c("a little lamb had Mary",