RecordLinkage (version 0.4-12.4)

strcmp: String Metrics

Description

Functions for computation of the similarity between two strings.

Usage

jarowinkler(str1, str2, W_1=1/3, W_2=1/3, W_3=1/3, r=0.5)
levenshteinSim(str1, str2)
levenshteinDist(str1, str2)

Value

A numeric vector with similarity values in the interval

\([0,1]\). For levenshteinDist, the edit distance as an integer vector.

Arguments

str1,str2

Two character vectors to compare.

W_1,W_2,W_3

Adjustable weights.

r

Maximum transposition radius. A fraction of the length of the shorter string.

Author

Andreas Borg, Murat Sariyar

Details

String metrics compute a similarity value in the range \([0,1]\) for two strings, with 1 denoting the highest (usually equality) and 0 denoting the lowest degree of similarity. In the context of Record Linkage, string similarities can improve the discernibility between matches and non-matches.

jarowinkler is an implementation of the algorithm by Jaro and Winkler (see references). For the meaning of W_1, W_2, W_3 and r see the referenced article. For most applications, the default values are reasonable.

levenshteinDist returns the Levenshtein distance, which cannot be directly used as a valid string comparator.

levenshteinSim is a similarity function based on the Levenshtein distance, calculated by \(1-\frac{\mathrm{d}(\mathit{str}_{1},\mathit{str}_{2})}{\max(A,B))}\), where \(\mathrm{d}\) is the Levenshtein distance function and \(A\) and \(B\) are the lengths of the strings.

Arguments str1 and str2 are expected to be of type "character". Non-alphabetical characters can be processed. Valid format combinations for the arguments are:

  • Two arrays with the same dimensions.

  • Two vectors. The shorter one is recycled as necessary.

References

Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association (1990), S. 354--369.

Examples

Run this code
# compare two strings:
jarowinkler("Andreas","Anreas")
# compare one string with several others:
levenshteinSim("Andreas",c("Anreas","Andeas"))
# compare two vectors of strings:
jarowinkler(c("Andreas","Borg"),c("Andreas","Bork"))

Run the code above in your browser using DataLab