String metrics compute a similarity value in the range \([0,1]\) for two strings, with 1 denoting the highest (usually equality) and 0 denoting the lowest degree of similarity. In the context of Record Linkage, string similarities can improve the discernibility between matches and non-matches.
jarowinkler
is an implementation of the algorithm by Jaro and Winkler (see references). For the meaning of W_1
, W_2
, W_3
and r
see the referenced article. For most applications, the default values are reasonable.
levenshteinDist
returns the Levenshtein distance, which cannot be directly used as a valid string comparator.
levenshteinSim
is a similarity function based on the Levenshtein distance, calculated by
\(1-\frac{\mathrm{d}(\mathit{str}_{1},\mathit{str}_{2})}{\max(A,B))}\), where \(\mathrm{d}\) is the Levenshtein distance
function and \(A\) and \(B\) are the lengths of the strings.
Arguments str1
and str2
are expected to be of type
"character"
.
Non-alphabetical characters can be processed. Valid format combinations for
the arguments are: