stringdist-package: A package for string distance calculation

Description

A package for string distance calculation

Arguments

Supported distances

The Hamming distance (hamming) counts the number of character substitutions that turns b into a. If a and b have different number of characters or if maxDist is exceeded, Inf is returned.

The Levenshtein distance (lv) counts the number of deletions, insertions and substitutions necessary to turn b into a. This method is equivalent to R's native adist function. If maxDist is exceeded Inf is returned.

The Optimal String Alignment distance (osa) is like the Levenshtein distance but also allows transposition of adjacent characters. Here, each substring may be edited only once. (For example, a character cannot be transposed twice to move it forward in the string). If maxDist is exceeded Inf is returned.

The full Damerau-Levensthein distance (dl) allows for multiple edits on substrings. If maxDist is exceeded Inf is returned.

The longest common substring is defined as the longest string that can be obtained by pairing characters from a and b while keeping the order of characters intact. The lcs-distance is defined as the number of unpaired characters. The distance is equivalent to the edit distance allowing only deletions and insertions, each with weight one. If maxDist is exceeded Inf is returned.

A $q$-gram is a subsequence of $q$ consecutive characters of a string. If $x$ ($y$) is the vector of counts of $q$-gram occurrences in a (b), the $q$-gram distance is given by the sum over the absolute differences $|x_i-y_i|$. The computation is aborted when q is is larger than the length of any of the strings. In that case Inf is returned.

The cosine distance is computed as $1-x\cdot y/(\|x\|\|y\|)$, where $x$ and $y$ were defined above.

Let $X$ be the set of unique $q$-grams in a and $Y$ the set of unique $q$-grams in b. The Jaccard distance is given by $1-|X\cap Y|/|X\cup Y|$.

The Jaro distance (method='jw', p=0), is a number between 0 (exact match) and 1 (completely dissimilar) measuring dissimilarity between strings. It is defined to be 0 when both strings have length 0, and 1 when there are no character matches between a and b. Otherwise, the Jaro distance is defined as $1-(1/3)(w_1m/|a| + w_2m/|b| + w_3(m-t)/m)$. Here,$|a|$ indicates the number of characters in a, $m$ is the number of character matches and $t$ the number of transpositions of matching characters. The $w_i$ are weights associated with the characters in a, characters in b and with transpositions. A character $c$ of a matches a character from b when $c$ occurs in b, and the index of $c$ in a differs less than $\max(|a|,|b|)/2 -1$ (where we use integer division) from the index of $c$ in b. Two matching characters are transposed when they are matched but they occur in different order in string a and b.

The Jaro-Winkler distance (method=jw, 0) adds a correction term to the Jaro-distance. It is defined as $d - l*p*d$, where $d$ is the Jaro-distance. Here, $l$ is obtained by counting, from the start of the input strings, after how many characters the first character mismatch between the two strings occurs, with a maximum of four. The factor $p$ is a penalty factor, which in the work of Winkler is often chosen $0.1$.

For the soundex method, strings are translated to a soundex code (see phonetic for a specification). The distance between strings is 0 when they have the same soundex code, otherwise 1. Note that soundex recoding is only meaningful for characters in the ranges a-z and A-Z. A warning is emitted when non-printable or non-ascii characters are encountered. Also see printable_ascii.

References

Mark P.J. van der Loo (2014) Approximate text matching with the stringdist package. The R Journal 6(1) pp 111-122.

An extensive overview of offline string matching algorithms is given by L. Boytsov (2011). Indexing methods for approximate dictionary searching: comparative analyses. ACM Journal of experimental algorithmics 16 1-88. An extensive overview of (online) string matching algorithms is given by G. Navarro (2001). A guided tour to approximate string matching, ACM Computing Surveys 33 31-88. Many algorithms are available in pseudocode from wikipedia: http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance.

A good reference for qgram distances is E. Ukkonen (1992), Approximate string matching with q-grams and maximal matches. Theoretical Computer Science, 92, 191-211.

http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance{Wikipedia} describes the Jaro-Winker distance used in this package. Unfortunately, there seems to be no single definition for the Jaro distance in literature. For example Cohen, Ravikumar and Fienberg (Proceeedings of IIWEB03, Vol 47, 2003) report a different matching window for characters in strings a and b.