stringdist: Compute distance between strings

Description

Compute distance between strings

Usage

stringdist(a, b,
    method = c("osa", "lv", "dl", "h", "lcs", "qgram"),
    weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = 0,
    q = 1)
  stringdistmatrix(a, b,
    method = c("osa", "lv", "dl", "h", "lcs", "qgram"),
    weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = 0,
    q = 1, ncores = 1, cluster = NULL)

Arguments

R object (target); will be converted by as.character.

R object (source); will be converted by as.character.

method

Method for distance calculation (see details)

weight

The penalty for deletion, insertion, substitution and transposition, in that order. Weights must be positive and not exceed 1. weight[4] is ignored when method='lv' and weight is ignored completely when

maxDist

Maximum string distance before calculation is stopped, maxDist=0 means calculation goes on untill the distance is computed. Ignored for method='qgram'.

size of the $q$-gram, must be nonnegative. Ignored for all but method='qgram'.

ncores

number of cores to use. If ncores>1, a local cluster is created using makeCluster. Parallelisation is over b, so the speed gain by parallelisation is highest wh

cluster

(optional) a custom cluster, created with makeCluster. If cluster is not NULL, ncores is ignored.

Value

For stringdist, a vector with string distances of size max(length(a),length(b)). For stringdistmatrix, a length(a)xlength(b) matrix. The returned distance is nonnegative if it can be computed, NA if any of the two argument strings is NA and -1 when it cannot be computed. See details for the meaning of -1 for the various algorithms.

Details

stringdist computes pairwise string distances between elements of character vectors a and b, where the vector with less elements is recycled. stringdistmatrix computes the string distance matrix with rows according to a and columns according to b.

Currently, the following distance metrics are supported: ll{ osa Optimal string aligment, (restricted Damerau-Levenshtein distance). lv Levenshtein distance. dl Full Damerau-Levenshtein distance. h Hamming distance (a and b must have same nr of characters). lcs Longest common substring. qgram $q$-gram distance. } The Hamming distance counts the number of character substitutions that turns b into a. If a and b have different number of characters -1 is returned.

The Levenshtein distance (ld) counts the number of deletions, insertions and substitutions necessary to turn b into a. This method is equivalent to R's native adist function. The computation is aborted when maxDist is exceeded, in which case -1 is returned.

The Optimal String Alignment distance (osa) is like the Levenshtein distance but also allows transposition of adjacent characters. Here, each substring may be edited only once so a character cannot be transposed twice. The computation is aborted when maxDist is exceeded, in which case -1 is returned.

The full Damerau-Levensthein distance (dl) allows for multiple transpositions. The computation is aborted when maxDist is exceeded, in which case -1 is returned.

The longest common substring is defined as the longest string that can be obtained by pairing characters from a and b while keeping the order of characters intact. The lcs-distance is defined as the number of unpaired characters. The distance is equivalent to the edit distance allowing only deletions and insertions, each with weight one. The computation is aborted when maxDist is exceeded, in which case -1 is returned.

A $q$-gram is a subsequence of $q$ consecutive characters of a string. If $x$ ($y$) is the vector of counts of $q$-gram occurrences in a (b), the $q$-gram distance is given by the sum over the absolute differences $|x_i-y_i|$. The computation is aborted when q is is larger than the length of any of the strings. In that case -1 is returned.

Encoding issues

Input strings are re-encoded to utf8 an then to integer vectors prior to the distance calculation (since the underlying C-code expects unsigned ints). This double conversion is necessary as it seems the only way to reliably convert (possibly multibyte) characters to integers on all systems supported by R. (R's native adist function does this as well). See Encoding for further details.

Paralellization

The stringdistmatrix function uses makeCluster to generate a cluster and compute the distance matrix in parallel. As the cluster is local, the ncores parameter should not be larger than the number of cores on your machine. Use detectCores to check the number of cores available. Alternatively, you can create a cluster by yourself, using makeCluster and pass that to stringdistmatrix.

References

R.W. Hamming (1950). Error detecting and Error Correcting codes, The Bell System Technical Journal 29, 147-160

V.I. Levenshtein. (1960). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10 707-711. F.J. Damerau (1964) A technique for computer detection and correction of spelling errors. Communications of the ACM 7 171-176. Many algorithms are available in pseudocode from wikipedia: http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance. The code for the full Damerau-Levenshtein distance was adapted from Nick Logan's public github repository: https://github.com/ugexe/Text--Levenshtein--Damerau--XS/blob/master/damerau-int.c.

A good reference for qgram distances is E. Ukkonen (1992), Approximate string matching with q-grams and maximal matches. Theoretical Computer Science, 92, 191-211.

Examples

Run this code

# Simple example using optimal string alignment
stringdist("ca","abc")

# The same example using Damerau-Levenshtein distance (multiple editing of substrings allowed)
stringdist("ca","abc",method="dl")

# string distance matching is case sensitive:
stringdist("ABC","abc")

# so you may want to normalize a bit:
stringdist(tolower("ABC"),"abc")

# stringdist recycles the shortest argument:
stringdist(c('a','b','c'),c('a','c'))

# stringdistmatrix gives the distance matrix (by default for optimal string alignment):
stringdist(c('a','b','c'),c('a','c'))

# different edit operations may be weighted; e.g. weighted substitution:
stringdist('ab','ba',weight=c(1,1,1,0.5))

# Non-unit weights for insertion and deletion makes the distance metric asymetric
stringdist('ca','abc')
stringdist('abc','ca')
stringdist('ca','abc',weight=c(0.5,1,1,1))
stringdist('abc','ca',weight=c(0.5,1,1,1))

# Hamming distance is undefined (or sometimes called infinite) for 
# strings of unequal lengths  (stringdist returns -1)
stringdist("ab","abc",method="h")
# For strings of eqal length it counts the number of unequal characters as they occur
# in the strings from beginning to end
stringdist("hello","HeLl0",method="h")

# The lcm (longest common substring) distance returns the number of 
# characters that are not part of the lcs.
#
# Here, the lcs is either 'a' or 'b' and one character cannot be paired:
stringdist('ab','ba',method="lcs")
# Here the lcs is 'surey' and 'v', 'g' and one 'r' of 'surgery' are not paired
stringdist('survey','surgery',method="lcs")


# q-grams are based on the difference between occurrences of q consecutive characters.
#
# since each character abc occurs in 'abc' and 'cba', the q=1 distance equals 0:
stringdist('abc','cba',method='qgram',q=1)

# since the first string consists of 'ab','bc' and the second 
# of 'cb' and 'ba', the q=2 distance equals 4 (they have no q=2 grams in common):
stringdist('abc','cba',method='qgram',q=2)

Run the code above in your browser using DataLab