Learn R Programming

stringdist (version 0.4-0)

stringdist: Compute distance between strings

Description

Compute distance between strings

Usage

stringdist(a, b, method = c("osa", "lv", "dl", "h"),
    weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = 0)

stringdistmatrix(a, b, method = c("osa", "lv", "dl", "h"), weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = 0, ncores = 1)

Arguments

a
R object (target); will be converted by as.character.
b
R object (source); will be converted by as.character.
method
Method for distance calculation (see details)
weight
The penalty for deletion, insertion, substitution and transposition, in that order. Weights must be positive and not exceed 1. weight[4] is ignored when method='lv' and weight is ignored completely when
maxDist
Maximum string distance before calculation is stopped, maxDist=0 means calculation goes on untill the distance is computed.
ncores
number of cores to use. Parallelisation is over b, so the speed gain by parallelisation is highest when b is shorter than a.

Value

  • For stringdist, a vector with string distances of size max(length(a),length(b)). For stringdistmatrix, a length(a)xlength(b) matrix. The returned distance is -1 when maxDist is exceeded and NA if any of a or b is NA.

Details

stringdist computes pairwise string distances between elements of character vectors a and b, where the shorter argument is recycled. stringdistmatrix computes the string distance matrix with rows according to a and columns according to b.

The string distance metrics in this package are based on counting the (weighted) number of edit operations it takes to turn character b into character a. Currently, the following distance metrics are supported: ll{ osa Optimal string aligment, (restricted Damerau-Levenshtein distance). lv Levenshtein distance. dl Full Damerau-Levenshtein distance. h Hamming distance (a and b must have same nr of characters). } The Hamming distance counts the number of character substitutions that turns b into a so a and b must have the same number of characters. The Levenshtein distance allows deletions, insertions and substitutions. The Optimal String Alignment distance also allows transpositions, but each substring may be edited only once, so a character cannot be transposed twice. The Damerau-Levensthein distance alows multiple transpositions.

Encoding issues

Input strings are re-encoded to utf8 an then to integer vectors prior to the distance calculation (which works on unsigned ints). This double conversion is necessary as it seems the only way to reliably convert (possibly multibyte) characters to integers on all systems supported by R. (R's native adist function does this as well). See Encoding for further details.

Paralellization

The stringdistmatrix function uses makeCluster to generate a cluster and compute the distance matrix in parallel. As the cluster is local, the ncores parameter should not be larger than the number of cores on your machine. Use detectCores to check the number of cores available.

References

  • R.W. Hamming (1950). Error detecting and Error Correcting codes, The Bell System Technical Journal 29, 147-160
V.I. Levenshtein. (1960). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10 707-711. F.J. Damerau (1964) A technique for computer detection and correction of spelling errors. Communications of the ACM 7 171-176. Many algorithms are available in pseudocode from wikipedia: http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance.

Examples

Run this code
# Simple example using optimal string alignment
stringdist("ca","abc")

# The same example using Damerau-Levenshtein distance (multiple editing of substrings allowed)
stringdist("ca","abc",method="dl")

# string distance matching is case sensitive:
stringdist("ABC","abc")

# so you may want to normalize a bit:
stringdist(tolower("ABC"),"abc")

# stringdist recycles the shortest argument:
stringdist(c('a','b','c'),c('a','c'))

# stringdistmatrix gives the distance matrix:
stringdist(c('a','b','c'),c('a','c'))

# different edit operations may be weighted; e.g. weighted substitution:
stringdist('ab','ba',weight=c(1,1,1,0.5))

# Non-unit weights for insertion and deletion makes the distance metric asymetric
stringdist('ca','abc')
stringdist('abc','ca')
stringdist('ca','abc',weight=c(0.5,1,1,1))
stringdist('abc','ca',weight=c(0.5,1,1,1))

Run the code above in your browser using DataLab