- Approximate matching and string distance calculations for R.
- All distance and matching operations are system- and encoding-independent.
- Built for speed, using openMP for parallel computing.
The package offers the following main functions:
stringdistcomputes pairwise distances between two input character vectors (shorter one is recycled)
stringdistmatrixcomputes the distance matrix for one or two vectors
stringsimcomputes a string similarity between 0 and 1, based on
amatchis a fuzzy matching equivalent of R's native
ainis a fuzzy matching equivalent of R's native
seq_ainfor distances between, and matching of integer sequences.
These functions are built upon
C-code that re-implements some common (weighted) string
distance functions. Distance functions include:
- Hamming distance;
- Levenshtein distance (weighted)
- Restricted Damerau-Levenshtein distance (weighted, a.k.a. Optimal String Alignment)
- Full Damerau-Levenshtein distance
- Longest Common Substring distance
- Q-gram distance
- cosine distance for q-gram count vectors (= 1-cosine similarity)
- Jaccard distance for q-gram count vectors (= 1-Jaccard similarity)
- Jaro, and Jaro-Winkler distance
- Soundex-based string distance
Also, there are some utility functions:
qgrams()tabulates the qgrams in one or more
seq_qrams()tabulates the qgrams (somtimes called ngrams) in one or more
phonetic()computes phonetic codes of strings (currently only soundex)
printable_ascii()is a utility function that detects non-printable ascii or non-ascii characters.
C functions can be called directly from
C code in other packages. The description of the API can be found by either
?stringdist_api in the R console or open the vignette directly as follows: