comparator (version 0.1.2)

Jaro: Jaro String/Sequence Comparator

Description

Compares a pair of strings/sequences x and y based on the number of greedily-aligned characters/sequence elements and the number of transpositions. It was developed for comparing names at the U.S. Census Bureau.

Usage

Jaro(similarity = TRUE, ignore_case = FALSE, use_bytes = FALSE)

Arguments

similarity

a logical. If TRUE, similarity scores are returned (default), otherwise distances are returned (see definition under Details).

ignore_case

a logical. If TRUE, case is ignored when comparing strings.

use_bytes

a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character.

Value

A Jaro instance is returned, which is an S4 class inheriting from '>StringComparator.

Details

For simplicity we assume x and y are strings in this section, however the comparator is also implemented for more general sequences.

When similarity = TRUE (default), the Jaro similarity is computed as $$\mathrm{sim}(x, y) = \frac{1}{3}\left(\frac{m}{|x|} + \frac{m}{|y|} + \frac{m - \lfloor \frac{t}{2} \rfloor}{m}\right)$$ where \(m\) is the number of "matching" characters (defined below), \(t\) is the number of "transpositions", and \(|x|,|y|\) are the lengths of the strings \(x\) and \(y\). The similarity takes on values in the range \([0, 1]\), where 1 corresponds to a perfect match.

The number of "matching" characters \(m\) is computed using a greedy alignment algorithm. The algorithm iterates over the characters in \(x\), attempting to align the \(i\)-th character \(x_i\) with the first matching character in \(y\). When looking for matching characters in \(y\), the algorithm only considers previously un-matched characters within a window \([\max(0, i - w), \min(|y|, i + w)]\) where \(w = \left\lfloor \frac{\max(|x|, |y|)}{2} \right\rfloor - 1\). The alignment process yields a subsequence of matching characters from \(x\) and \(y\). The number of "transpositions" \(t\) is defined to be the number of positions in the subsequence of \(x\) which are misaligned with the corresponding position in \(y\).

When similarity = FALSE, the Jaro distance is computed as $$\mathrm{dist}(x,y) = 1 - \mathrm{sim}(x,y).$$

References

Jaro, M. A. (1989), "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida", Journal of the American Statistical Association 84(406), 414-420.

See Also

The JaroWinkler comparator modifies the Jaro comparator by boosting the similarity score for strings/sequences that have matching prefixes.

Examples

Run this code
# NOT RUN {
## Compare names
Jaro()("Martha", "Mathra")
Jaro()("Eileen", "Phyllis")

# }

Run the code above in your browser using DataLab