stringdist v0.9.6

0

Monthly downloads

0th

Percentile

Approximate String Matching, Fuzzy Text Search, and String Distance Functions

Implements an approximate string matching version of R's native 'match' function. Also offers fuzzy text search based on various string distance measures. Can calculate various string distances based on edits (Damerau-Levenshtein, Hamming, Levenshtein, optimal sting alignment), qgrams (q- gram, cosine, jaccard distance) or heuristic metrics (Jaro, Jaro-Winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences. This package is built for speed and runs in parallel by using 'openMP'. An API for C or C++ is exposed as well.

Readme

Mentioned in Awesome Official Statistics

stringdist

  • Approximate matching and string distance calculations for R.
  • All distance and matching operations are system- and encoding-independent.
  • Built for speed, using openMP for parallel computing.

The package offers the following main functions:

  • stringdist computes pairwise distances between two input character vectors (shorter one is recycled)
  • stringdistmatrix computes the distance matrix for one or two vectors
  • stringsim computes a string similarity between 0 and 1, based on stringdist
  • amatch is a fuzzy matching equivalent of R's native match function
  • ain is a fuzzy matching equivalent of R's native %in% operator
  • seq_dist, seq_distmatrix, seq_amatch and seq_ain for distances between, and matching of integer sequences.

These functions are built upon C-code that re-implements some common (weighted) string distance functions. Distance functions include:

  • Hamming distance;
  • Levenshtein distance (weighted)
  • Restricted Damerau-Levenshtein distance (weighted, a.k.a. Optimal String Alignment)
  • Full Damerau-Levenshtein distance
  • Longest Common Substring distance
  • Q-gram distance
  • cosine distance for q-gram count vectors (= 1-cosine similarity)
  • Jaccard distance for q-gram count vectors (= 1-Jaccard similarity)
  • Jaro, and Jaro-Winkler distance
  • Soundex-based string distance

Also, there are some utility functions:

  • qgrams() tabulates the qgrams in one or more character vectors.
  • seq_qrams() tabulates the qgrams (somtimes called ngrams) in one or more integer vectors.
  • phonetic() computes phonetic codes of strings (currently only soundex)
  • printable_ascii() is a utility function that detects non-printable ascii or non-ascii characters.

C API

Some of stringdist's underlying C functions can be called directly from C code in other packages. The description of the API can be found by either typing ?stringdist_api in the R console or open the vignette directly as follows:

vignette("stringdist_C-Cpp_api", package="stringdist")

Examples of packages that link to stringdist can be found here and here.

Resources

  • A paper on stringdist has been published in the R-journal
  • Slides of a talk given at te useR!2014 conference.

Functions in stringdist

Name Description
phonetic Phonetic algorithms
stringdist_api Calling stringdist from C or C++
printable_ascii Detect the presence of non-printable or non-ascii characters
seq_sim Compute similarity scores between sequences of integers
seq_dist Compute distance metrics between integer sequences
seq_amatch Approximate matching for integer sequences.
afind Stringdist-based fuzzy text search
qgrams Get a table of qgram counts from one or more character vectors.
stringdist Compute distance metrics between strings
stringdist-encoding String metrics in stringdist
stringdist-package A package for string distance calculation and approximate string matching.
stringdist-metrics String metrics in stringdist
seq_qgrams Get a table of qgram counts for integer sequences
stringdist-parallelization Multithreading and parallelization in stringdist
stringsim Compute similarity scores between strings
amatch Approximate string matching
No Results!

Vignettes of stringdist

Name
RJournal_6_111-122-2014.Rnw
loo2014stringdist.pdf
stringdist_C-Cpp_api.Rnw
stringdist_api.pdf
No Results!

Last month downloads

Details

License GPL-3
LazyData no
Type Package
LazyLoad yes
URL https://github.com/markvanderloo/stringdist
BugReports https://github.com/markvanderloo/stringdist/issues
Encoding UTF-8
RoxygenNote 7.1.0
NeedsCompilation yes
Packaged 2020-07-16 13:32:28 UTC; mark
Repository CRAN
Date/Publication 2020-07-16 14:00:02 UTC

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/stringdist)](http://www.rdocumentation.org/packages/stringdist)