stringdist: Compute distance metrics between strings

Description

Compute distance metrics between strings

Usage

stringdist(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram",
  "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1,
  i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0)
stringdistmatrix(a, b, method = c("osa", "lv", "dl", "hamming", "lcs",
  "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE,
  weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0,
  useNames = FALSE, ncores = 1, cluster = NULL)

Arguments

R object (target); will be converted by as.character.

R object (source); will be converted by as.character.

method

Method for distance calculation. The default is "osa" (see details).

useBytes

Perform byte-wise comparison. useBytes=TRUE is faster but may yield different results depending on character encoding. See also below, under ``encoding issues''.

weight

For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. When method='lv', the penalty for transposition is ignored. When method='jw', the weights ass

maxDist

[DEPRECATED AND WILL BE REMOVED] Currently kept for backward compatibility. It does not offer any speed gain. (In fact, it currently slows things down when set to anything different from Inf).

Size of the $q$-gram; must be nonnegative. Only applies to method='qgram', 'jaccard' or 'cosine'.

Penalty factor for Jaro-Winkler distance. The valid range for p is

0 <= p="" <="0.25.
 If p=0 (default), the Jaro-distance is returned. Applies only to method='jw'.

useNames

Use input vectors as row and column names?

ncores

Number of cores to use. If ncores>1, a local cluster is created using makeCluster. Parallelisation is over b, so the speed gain by parallelisation is highest when

cluster

(Optional) a custom cluster, created with makeCluster. If cluster is not NULL, ncores is ignored.

Value

For stringdist, a vector with string distances of size max(length(a),length(b)). For stringdistmatrix, a length(a)xlength(b) matrix. The returned distance is nonnegative if it can be computed, NA if any of the two argument strings is NA and Inf when maxDist is exceeded or, in case of the hamming distance, when the two compared strings have different length.

Details

stringdist computes pairwise string distances between elements of character vectors a and b, where the vector with less elements is recycled.

stringdistmatrix computes the string distance matrix with rows according to a and columns according to b.

Currently, the following distance metrics are supported: ll{ osa Optimal string aligment, (restricted Damerau-Levenshtein distance). lv Levenshtein distance (as in R's native adist). dl Full Damerau-Levenshtein distance. hamming Hamming distance (a and b must have same nr of characters). lcs Longest common substring distance. qgram $q$-gram distance. cosine cosine distance between $q$-gram profiles jaccard Jaccard distance between $q$-gram profiles jw Jaro, or Jaro-Winker distance. soundex Distance based on soundex encoding (see below) }

A short description of these algorithms is proveded here, or see the http://journal.r-project.org/archive/2014-1/loo.pdf{R Journal Paper} (external link) for more formal descriptions.

Encoding issues

If bytes=FALSE, input strings are re-encoded to utf8 an then to integer vectors prior to the distance calculation (since the underlying C-code expects unsigned ints). This double conversion is necessary as it seems the only way to reliably convert (possibly multibyte) characters to integers on all systems supported by R. R's native adist function does this as well.

If bytes=TRUE, the input strings are treated as if each byte was a single character. This may be significantly faster since it avoids conversion of utf8 to integer with utf8ToInt (up to a factor of 3, for strings of 5-25 characters). However, results may depend on the (possibly multibyte) character encoding scheme and note that R's internal encoding scheme is OS-dependent. If you're sure that all your input is ASCII, you can safely set useBytes=TRUE to profit from the speed gain on any platform.

See base R's Encoding and iconv documentation for details on how R handles character encoding.

Unicode normalisation

In utf-8, the same (accented) character may be represented as several byte sequences. For example, an u-umlaut can be represented with a single byte code or as a byte code representing 'u' followed by a modifier byte code that adds the umlaut. The http://cran.r-project.org/web/packages/stringi/{stringi} package of Gagolevski and Tartanus offers unicode normalisation tools.

Paralellization

The stringdistmatrix function uses makeCluster to create a local cluster and compute the distance matrix in parallel when ncores>1. The cluster is terminated after the matrix has been computed. As the cluster is local, the ncores parameter should not be larger than the number of cores on your machine. Use detectCores to check the number of cores available. Alternatively, you can create a cluster using makeCluster and pass that to stringdistmatrix (through the cluster argument. This allows you to reuse the cluster setup for other calculations. There is overhead in creating clusters, so creating the cluster yourself is a good choice if you want to call stringdistmatrix multiple times, for example in a loop.

Acknowledgement

The code for the full Damerau-Levenshtein distance was adapted from Nick Logan's https://github.com/ugexe/Text--Levenshtein--Damerau--XS/blob/master/damerau-int.c{public github repository}.

Citation

If you would like to cite this package, please cite the http://journal.r-project.org/archive/2014-1/loo.pdf{R Journal Paper}:

M.P.J. van der Loo (2014). Thestringdistpackage for approximate string matching. R Journal 6(1) pp 111-122

code

citation('stringdist')

other

Raffael Vogler wrote a nicehttp://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/{blog} comparing different string distances in this package.

Examples

Run this code

# Simple example using optimal string alignment
stringdist("ca","abc")

# The same example using Damerau-Levenshtein distance (multiple editing of substrings allowed)
stringdist("ca","abc",method="dl")

# string distance matching is case sensitive:
stringdist("ABC","abc")

# so you may want to normalize a bit:
stringdist(tolower("ABC"),"abc")

# stringdist recycles the shortest argument:
stringdist(c('a','b','c'),c('a','c'))

# stringdistmatrix gives the distance matrix (by default for optimal string alignment):
stringdist(c('a','b','c'),c('a','c'))

# different edit operations may be weighted; e.g. weighted substitution:
stringdist('ab','ba',weight=c(1,1,1,0.5))

# Non-unit weights for insertion and deletion makes the distance metric asymetric
stringdist('ca','abc')
stringdist('abc','ca')
stringdist('ca','abc',weight=c(0.5,1,1,1))
stringdist('abc','ca',weight=c(0.5,1,1,1))

# Hamming distance is undefined for 
# strings of unequal lengths so stringdist returns Inf
stringdist("ab","abc",method="h")
# For strings of eqal length it counts the number of unequal characters as they occur
# in the strings from beginning to end
stringdist("hello","HeLl0",method="h")

# The lcm (longest common substring) distance returns the number of 
# characters that are not part of the lcs.
#
# Here, the lcs is either 'a' or 'b' and one character cannot be paired:
stringdist('ab','ba',method="lcs")
# Here the lcs is 'surey' and 'v', 'g' and one 'r' of 'surgery' are not paired
stringdist('survey','surgery',method="lcs")


# q-grams are based on the difference between occurrences of q consecutive characters
# in string a and string b.
# Since each character abc occurs in 'abc' and 'cba', the q=1 distance equals 0:
stringdist('abc','cba',method='qgram',q=1)

# since the first string consists of 'ab','bc' and the second 
# of 'cb' and 'ba', the q=2 distance equals 4 (they have no q=2 grams in common):
stringdist('abc','cba',method='qgram',q=2)

# Wikipedia has the following example of the Jaro-distance. 
stringdist('MARTHA','MATHRA',method='jw')
# Note that stringdist gives a  _distance_ where wikipedia gives the corresponding 
# _similarity measure_. To get the wikipedia result:
1 - stringdist('MARTHA','MATHRA',method='jw')

# The corresponding Jaro-Winkler distance can be computed by setting p=0.1
stringdist('MARTHA','MATHRA',method='jw',p=0.1)
# or, as a similarity measure
1 - stringdist('MARTHA','MATHRA',method='jw',p=0.1)

# This gives distance 1 since Euler and Gauss translate to different soundex codes.
stringdist('Euler','Gauss',method='soundex')
# Euler and Ellery translate to the same code and have distance 0
stringdist('Euler','Ellery',method='soundex')

Run the code above in your browser using DataLab