Jaro: Jaro String/Sequence Comparator

Description

Compares a pair of strings/sequences x and y based on the number of greedily-aligned characters/sequence elements and the number of transpositions. It was developed for comparing names at the U.S. Census Bureau.

Usage

Jaro(similarity = TRUE, ignore_case = FALSE, use_bytes = FALSE)

Value

A Jaro instance is returned, which is an S4 class inheriting from StringComparator.

Arguments

similarity: a logical. If TRUE, similarity scores are returned (default), otherwise distances are returned (see definition under Details).
ignore_case: a logical. If TRUE, case is ignored when comparing strings.
use_bytes: a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character.

Details

For simplicity we assume x and y are strings in this section, however the comparator is also implemented for more general sequences.

When similarity = TRUE (default), the Jaro similarity is computed as $$\mathrm{sim}(x, y) = \frac{1}{3}\left(\frac{m}{|x|} + \frac{m}{|y|} + \frac{m - \lfloor \frac{t}{2} \rfloor}{m}\right)$$ where $m$ is the number of "matching" characters (defined below), $t$ is the number of "transpositions", and $|x|,|y|$ are the lengths of the strings $x$ and $y$. The similarity takes on values in the range $[0, 1]$, where 1 corresponds to a perfect match.

The number of "matching" characters $m$ is computed using a greedy alignment algorithm. The algorithm iterates over the characters in $x$, attempting to align the $i$-th character $x_i$ with the first matching character in $y$. When looking for matching characters in $y$, the algorithm only considers previously un-matched characters within a window $[\max(0, i - w), \min(|y|, i + w)]$ where $w = \left\lfloor \frac{\max(|x|, |y|)}{2} \right\rfloor - 1$. The alignment process yields a subsequence of matching characters from $x$ and $y$. The number of "transpositions" $t$ is defined to be the number of positions in the subsequence of $x$ which are misaligned with the corresponding position in $y$.

When similarity = FALSE, the Jaro distance is computed as $$\mathrm{dist}(x,y) = 1 - \mathrm{sim}(x,y).$$

References

Jaro, M. A. (1989), "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida", Journal of the American Statistical Association 84(406), 414-420.

Examples

Run this code

## Compare names
Jaro()("Martha", "Mathra")
Jaro()("Eileen", "Phyllis")

Run the code above in your browser using DataLab