Learn R Programming

stringdist (version 0.9.1)

stringdist-package: A package for string distance calculation and approximate string matching.

Description

A package for string distance calculation and approximate string matching.

Arguments

Introduction

The stringdist package offers fast and platform-independent string metrics. Its main purpose is to compute various string distances and to do approximate text matching between character vectors. A typical use is to match strings that are not precisely the same. For example

amatch(c("hello","g'day"),c("hi","hallo","ola"),maxDist=2)

returns c(2,NA) since "hello" matches closest with "hallo", and within the maximum (optimal string alignment) distance. The second element, "g'day", matches closest with "ola" but since the distance equals 4, no match is reported.

A second typical use is to compute string distances. For example

stringdist(c("g'day"),c("hi","hallo","ola"))

Returns c(5,5,4) since these are the distances between "g'day" and respectively "hi", "hallo", and "ola".

A third typical use would be to compute a dist object, that can be used to cluster text strings.

stringdistmatrix(c("foo","bar","boo","baz"))

Returns an object of class dist that can be used by clustering algorithms in the cluster package (such as hclust).

Besides documentation for each function, the main topics documented are:

stringdist-encoding -- how encoding is handled by the package stringdist-parallelization -- on multithreading

Acknowledgements

  • The code for the full Damerau-Levenshtein distance was adapted from Nick Logan'shttps://github.com/ugexe/Text--Levenshtein--Damerau--XS/blob/master/damerau-int.c{public github repository}.
C code for converting UTF-8 to integer was copied from the R core for performance reasons. The code for soundex conversion was kindly contributed by Jan van der Laan.

Citation

If you would like to cite this package, please cite the http://journal.r-project.org/archive/2014-1/loo.pdf{R Journal Paper}:
  • M.P.J. van der Loo (2014). Thestringdistpackage for approximate string matching. R Journal 6(1) pp 111-122

code

citation('stringdist')