stringdist: Compute distance metrics between strings

Description

Compute distance metrics between strings

Usage

stringdist(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram",
  "cosine", "jaccard", "jw"), useBytes = FALSE, weight = c(d = 1, i = 1, s =
  1, t = 1), maxDist = Inf, q = 1, p = 0)
stringdistmatrix(a, b, method = c("osa", "lv", "dl", "hamming", "lcs",
  "qgram", "cosine", "jaccard", "jw"), useBytes = FALSE, weight = c(d = 1, i
  = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, ncores = 1,
  cluster = NULL)

Arguments

R object (target); will be converted by as.character.

R object (source); will be converted by as.character.

method

Method for distance calculation. The default is "osa" (see details).

useBytes

Perform byte-wise comparison. useBytes=TRUE is faster but may yield different results depending on character encoding. See also below, under ``encoding issues''.

weight

For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. When method='lv', the penalty for transposition is ignored. When method='jw', the we

maxDist

[DEPRECATED AND WILL BE REMOVED] Maximum string distance for edit-like distances, in some cases computation is stopped when maxDist is reached. maxDist=Inf means calculation goes on untill the distance is computed. Does n

Size of the $q$-gram; must be nonnegative. Only applies to method='qgram', 'jaccard' or 'cosine'.

Penalty factor for Jaro-Winkler distance. The valid range for p is

0 <= p="" <="0.25.  If
  p=0 (default), the Jaro-distance is returned.
  Applies only to method='jw'.

ncores

Number of cores to use. If ncores>1, a local cluster is created using makeCluster. Parallelisation is over b, so the speed gain by parallelisation is highest wh

cluster

(Optional) a custom cluster, created with makeCluster. If cluster is not NULL, ncores is ignored.

Value

For stringdist, a vector with string distances of size max(length(a),length(b)). For stringdistmatrix, a length(a)xlength(b) matrix. The returned distance is nonnegative if it can be computed, NA if any of the two argument strings is NA and Inf when it cannot be computed or maxDist is exceeded. See details for the meaning of Inf for the various algorithms.

Details

stringdist computes pairwise string distances between elements of character vectors a and b, where the vector with less elements is recycled. stringdistmatrix computes the string distance matrix with rows according to a and columns according to b.

Currently, the following distance metrics are supported: ll{ osa Optimal string aligment, (restricted Damerau-Levenshtein distance). lv Levenshtein distance (as in R's native adist). dl Full Damerau-Levenshtein distance. hamming Hamming distance (a and b must have same nr of characters). lcs Longest common substring distance. qgram $q$-gram distance. cosine cosine distance between $q$-gram profiles jaccard Jaccard distance between $q$-gram profiles jw Jaro, or Jaro-Winker distance. } The Hamming distance (hamming) counts the number of character substitutions that turns b into a. If a and b have different number of characters or if maxDist is exceeded, Inf is returned.

The Levenshtein distance (lv) counts the number of deletions, insertions and substitutions necessary to turn b into a. This method is equivalent to R's native adist function. If maxDist is exceeded Inf is returned.

The Optimal String Alignment distance (osa) is like the Levenshtein distance but also allows transposition of adjacent characters. Here, each substring may be edited only once so a character cannot be transposed twice. If maxDist is exceeded Inf is returned.

The full Damerau-Levensthein distance (dl) allows for multiple transpositions. If maxDist is exceeded Inf is returned.

The longest common substring is defined as the longest string that can be obtained by pairing characters from a and b while keeping the order of characters intact. The lcs-distance is defined as the number of unpaired characters. The distance is equivalent to the edit distance allowing only deletions and insertions, each with weight one. If maxDist is exceeded Inf is returned.

A $q$-gram is a subsequence of $q$ consecutive characters of a string. If $x$ ($y$) is the vector of counts of $q$-gram occurrences in a (b), the $q$-gram distance is given by the sum over the absolute differences $|x_i-y_i|$. The computation is aborted when q is is larger than the length of any of the strings. In that case Inf is returned.

The cosine distance is computed as $1-x\cdot y/(\|x\|\|y\|)$, where $x$ and $y$ were defined above.

Let $X$ be the set of unique $q$-grams in a and $Y$ the set of unique $q$-grams in b. The Jaccard distance is given by $1-|X\cap Y|/|X\cup Y|$.

The Jaro distance (method='jw', p=0), is a number between 0 (exact match) and 1 (completely dissimilar) measuring dissimilarity between strings. It is defined to be 0 when both strings have length 0, and 1 when there are no character matches between a and b. Otherwise, the Jaro distance is defined as $1-(1/3)(w_1m/|a| + w_2m/|b| + w_3(m-t)/m)$. Here,$|a|$ indicates the number of characters in a, $m$ is the number of character matches and $t$ the number of transpositions of matching characters. The $w_i$ are weights associated with the characters in a, characters in b and with transpositions. A character $c$ of a matches a character from b when $c$ occurs in b, and the index of $c$ in a differs less than $\max(|a|,|b|)/2 -1$ (where we use integer division) from the index of $c$ in b. Two matching characters are transposed when they are matched but they occur in different order in string a and b.

The Jaro-Winkler distance (method=jw, 0) adds a correction term to the Jaro-distance. It is defined as $d - l*p*d$, where $d$ is the Jaro-distance. Here, $l$ is obtained by counting, from the start of the input strings, after how many characters the first character mismatch between the two strings occurs, with a maximum of four. The factor $p$ is a penalty factor, which in the work of Winkler is often chosen $0.1$.

Encoding issues

If bytes=FALSE, input strings are re-encoded to utf8 an then to integer vectors prior to the distance calculation (since the underlying C-code expects unsigned ints). This double conversion is necessary as it seems the only way to reliably convert (possibly multibyte) characters to integers on all systems supported by R. R's native adist function does this as well.

If bytes=TRUE, the input strings are treated as if each byte was a single character. This may be significantly faster since it avoids conversion of utf8 to integer with utf8ToInt (up to a factor of 3, for strings of 5-25 characters). However, results may depend on the (possibly multibyte) character encoding scheme and note that R's internal encoding scheme is OS-dependent. If you're sure that all your input is ASCII, you can safely set useBytes=TRUE to profit from the speed gain on any platform.

See base R's Encoding and iconv documentation for details on how R handles character encoding.

Paralellization

The stringdistmatrix function uses makeCluster to create a local cluster and compute the distance matrix in parallel when ncores>1. The cluster is terminated after the matrix has been computed. As the cluster is local, the ncores parameter should not be larger than the number of cores on your machine. Use detectCores to check the number of cores available. Alternatively, you can create a cluster using makeCluster and pass that to stringdistmatrix (through the cluster argument. This allows you to reuse the cluster setup for other calculations. There is overhead in creating clusters, so creating the cluster yourself is a good choice if you want to call stringdistmatrix multiple times, for example in a loop.

References

R.W. Hamming (1950). Error detecting and Error Correcting codes, The Bell System Technical Journal 29, 147-160

V.I. Levenshtein. (1960). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10 707-711. F.J. Damerau (1964) A technique for computer detection and correction of spelling errors. Communications of the ACM 7 171-176. An extensive overview of (online) string matching algorithms is given by G. Navarro (2001). A guided tour to approximate string matching, ACM Computing Surveys 33 31-88. Many algorithms are available in pseudocode from wikipedia: http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance. The code for the full Damerau-Levenshtein distance was adapted from Nick Logan's https://github.com/ugexe/Text--Levenshtein--Damerau--XS/blob/master/damerau-int.c{public github repository}.

A good reference for qgram distances is E. Ukkonen (1992), Approximate string matching with q-grams and maximal matches. Theoretical Computer Science, 92, 191-211.

http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance{Wikipedia} describes the Jaro-Winker distance used in this package. Unfortunately, there seems to be no single definition for the Jaro distance in literature. For example Cohen, Ravikumar and Fienberg (Proceeedings of IIWEB03, Vol 47, 2003) report a different matching window for characters in strings a and b.

Raffael Vogler wrote a nice http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/{blog} comparing different string distances in this package.

Examples

Run this code

# Simple example using optimal string alignment
stringdist("ca","abc")

# The same example using Damerau-Levenshtein distance (multiple editing of substrings allowed)
stringdist("ca","abc",method="dl")

# string distance matching is case sensitive:
stringdist("ABC","abc")

# so you may want to normalize a bit:
stringdist(tolower("ABC"),"abc")

# stringdist recycles the shortest argument:
stringdist(c('a','b','c'),c('a','c'))

# stringdistmatrix gives the distance matrix (by default for optimal string alignment):
stringdist(c('a','b','c'),c('a','c'))

# different edit operations may be weighted; e.g. weighted substitution:
stringdist('ab','ba',weight=c(1,1,1,0.5))

# Non-unit weights for insertion and deletion makes the distance metric asymetric
stringdist('ca','abc')
stringdist('abc','ca')
stringdist('ca','abc',weight=c(0.5,1,1,1))
stringdist('abc','ca',weight=c(0.5,1,1,1))

# Hamming distance is undefined for 
# strings of unequal lengths so stringdist returns Inf
stringdist("ab","abc",method="h")
# For strings of eqal length it counts the number of unequal characters as they occur
# in the strings from beginning to end
stringdist("hello","HeLl0",method="h")

# The lcm (longest common substring) distance returns the number of 
# characters that are not part of the lcs.
#
# Here, the lcs is either 'a' or 'b' and one character cannot be paired:
stringdist('ab','ba',method="lcs")
# Here the lcs is 'surey' and 'v', 'g' and one 'r' of 'surgery' are not paired
stringdist('survey','surgery',method="lcs")


# q-grams are based on the difference between occurrences of q consecutive characters
# in string a and string b.
# Since each character abc occurs in 'abc' and 'cba', the q=1 distance equals 0:
stringdist('abc','cba',method='qgram',q=1)

# since the first string consists of 'ab','bc' and the second 
# of 'cb' and 'ba', the q=2 distance equals 4 (they have no q=2 grams in common):
stringdist('abc','cba',method='qgram',q=2)

# Wikipedia has the following example of the Jaro-distance. 
stringdist('MARTHA','MATHRA',method='jw')
# Note that stringdist gives a  _distance_ where wikipedia gives the corresponding 
# _similarity measure_. To get the wikipedia result:
1 - stringdist('MARTHA','MATHRA',method='jw')

# The corresponding Jaro-Winkler distance can be computed by setting p=0.1
stringdist('MARTHA','MATHRA',method='jw',p=0.1)
# or, as a similarity measure
1 - stringdist('MARTHA','MATHRA',method='jw',p=0.1)

Run the code above in your browser using DataLab