Learn R Programming

stringdist (version 0.9.0)

stringdist: Compute distance metrics between strings

Description

stringdist computes pairwise string distances between elements of character vectors a and b, where the vector with less elements is recycled. stringdistmatrix computes the string distance matrix with rows according to a and columns according to b.

Usage

stringdist(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram",
  "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1,
  i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0,
  nthread = getOption("sd_num_thread"))

stringdistmatrix(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, useNames = FALSE, ncores = 1, cluster = NULL, nthread = getOption("sd_num_thread"))

Arguments

a
R object (target); will be converted by as.character.
b
R object (source); will be converted by as.character.
method
Method for distance calculation. The default is "osa", see stringdist-metrics.
useBytes
Perform byte-wise comparison, see stringdist-encoding.
weight
For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. When method='lv', the penalty for transposition is ignored. When method='jw', the weights ass
maxDist
[DEPRECATED AND WILL BE REMOVED] Currently kept for backward compatibility. It does not offer any speed gain. (In fact, it currently slows things down when set to anything different from Inf).
q
Size of the $q$-gram; must be nonnegative. Only applies to method='qgram', 'jaccard' or 'cosine'.
p
Penalty factor for Jaro-Winkler distance. The valid range for p is 0 <= p="" <="0.25. If p=0 (default), the Jaro-distance is returned. Applies only to method='jw'.
nthread
Maximum number of threads to use. By default, a sensible number of threads is chosen, see stringdist-parallelization.
useNames
Use input vectors as row and column names?
ncores
[DEPRECATED AND WILL BE REMOVED]. Optionally use nthreads in stead. See below under parallelization of stringdistmatrix.
cluster
(Optional) a custom cluster, created with makeCluster.

Value

  • For stringdist, a vector with string distances of size max(length(a),length(b)).

    For stringdistmatrix, a length(a)xlength(b) matrix.

    Distances are nonnegative if they can be computed, NA if any of the two argument strings is NA and Inf when maxDist is exceeded or, in case of the hamming distance, when the two compared strings have different length.

Note on paralellization of <code>stringdistmatrix</code>

In older versions (<0.9) of="" stringdist, the cluster and ncores argument were the only paralellization options, and only for stringdistmatrix. These options are based on the parallel package which starts multiple R-sessions to run R code in parallel. If you're running R on a single machine it is both faster and easier to use the default multithreading, so do not specify ncores or cluster in such a case.

As of the introduction of the nthreads argument, the ncores argument is mostly useless, although it still works. If ncores>0, a local cluster of R-sessions is set up automatically. Each R-session will use nthread threads.

The cluster argument is only interesting when the cluster is set up over different physical nodes. For example when setting up a network of nodes accross physically different machines. In each node, nthread threads will be used.

Examples

Run this code
# Simple example using optimal string alignment
stringdist("ca","abc")

# The same example using Damerau-Levenshtein distance (multiple editing of substrings allowed)
stringdist("ca","abc",method="dl")

# string distance matching is case sensitive:
stringdist("ABC","abc")

# so you may want to normalize a bit:
stringdist(tolower("ABC"),"abc")

# stringdist recycles the shortest argument:
stringdist(c('a','b','c'),c('a','c'))

# stringdistmatrix gives the distance matrix (by default for optimal string alignment):
stringdist(c('a','b','c'),c('a','c'))

# different edit operations may be weighted; e.g. weighted substitution:
stringdist('ab','ba',weight=c(1,1,1,0.5))

# Non-unit weights for insertion and deletion makes the distance metric asymetric
stringdist('ca','abc')
stringdist('abc','ca')
stringdist('ca','abc',weight=c(0.5,1,1,1))
stringdist('abc','ca',weight=c(0.5,1,1,1))

# Hamming distance is undefined for 
# strings of unequal lengths so stringdist returns Inf
stringdist("ab","abc",method="h")
# For strings of eqal length it counts the number of unequal characters as they occur
# in the strings from beginning to end
stringdist("hello","HeLl0",method="h")

# The lcs (longest common substring) distance returns the number of 
# characters that are not part of the lcs.
#
# Here, the lcs is either 'a' or 'b' and one character cannot be paired:
stringdist('ab','ba',method="lcs")
# Here the lcs is 'surey' and 'v', 'g' and one 'r' of 'surgery' are not paired
stringdist('survey','surgery',method="lcs")


# q-grams are based on the difference between occurrences of q consecutive characters
# in string a and string b.
# Since each character abc occurs in 'abc' and 'cba', the q=1 distance equals 0:
stringdist('abc','cba',method='qgram',q=1)

# since the first string consists of 'ab','bc' and the second 
# of 'cb' and 'ba', the q=2 distance equals 4 (they have no q=2 grams in common):
stringdist('abc','cba',method='qgram',q=2)

# Wikipedia has the following example of the Jaro-distance. 
stringdist('MARTHA','MATHRA',method='jw')
# Note that stringdist gives a  _distance_ where wikipedia gives the corresponding 
# _similarity measure_. To get the wikipedia result:
1 - stringdist('MARTHA','MATHRA',method='jw')

# The corresponding Jaro-Winkler distance can be computed by setting p=0.1
stringdist('MARTHA','MATHRA',method='jw',p=0.1)
# or, as a similarity measure
1 - stringdist('MARTHA','MATHRA',method='jw',p=0.1)

# This gives distance 1 since Euler and Gauss translate to different soundex codes.
stringdist('Euler','Gauss',method='soundex')
# Euler and Ellery translate to the same code and have distance 0
stringdist('Euler','Ellery',method='soundex')

Run the code above in your browser using DataLab