comparator (version 0.1.2)

JaroWinkler: Jaro-Winkler String/Sequence Comparator

Description

The Jaro-Winkler comparator is a variant of the Jaro comparator which boosts the similarity score for strings/sequences with matching prefixes. It was developed for comparing names at the U.S. Census Bureau.

Usage

JaroWinkler(
  p = 0.1,
  threshold = 0.7,
  max_prefix = 4L,
  similarity = TRUE,
  ignore_case = FALSE,
  use_bytes = FALSE
)

Arguments

p

a non-negative numeric scalar no larger than 1/max_prefix. Similarity scores eligible for boosting are scaled by this factor.

threshold

a numeric scalar on the unit interval. Jaro similarities greater than this value are boosted based on matching characters in the prefixes of both strings. Jaro similarities below this value are returned unadjusted. Defaults to 0.7.

max_prefix

a non-negative integer scalar, specifying the size of the prefix to consider for boosting. Defaults to 4 (characters).

similarity

a logical. If TRUE, similarity scores are returned (default), otherwise distances are returned (see definition under Details).

ignore_case

a logical. If TRUE, case is ignored when comparing strings.

use_bytes

a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character.

Value

A JaroWinkler instance is returned, which is an S4 class inheriting from '>StringComparator.

Details

For simplicity we assume x and y are strings in this section, however the comparator is also implemented for more general sequences.

The Jaro-Winkler similarity (computed when similarity = TRUE) is defined in terms of the Jaro similarity. If the Jaro similarity \(sim_J(x,y)\) between strings \(x\) and \(y\) exceeds a user-specified threshold \(0 \leq \tau \leq 1\), the similarity score is boosted in proportion to the number of matching characters in the prefixes of \(x\) and \(y\). More precisely, the Jaro-Winkler similarity is defined as: $$\mathrm{sim}_{JW}(x, y) = \mathrm{sim}_J(x, y) + \min(c(x, y), l) p (1 - \mathrm{sim}_J(x, y)),$$ where \(c(x,y)\) is the length of the common prefix, \(l \geq 0\) is a user-specified upper bound on the prefix size, and \(0 \leq p \leq 1/l\) is a scaling factor.

The Jaro-Winkler distance is computed when similarity = FALSE and is defined as $$\mathrm{dist}_{JW}(x, y) = 1 - \mathrm{sim}_{JW}(x, y).$$

References

Jaro, M. A. (1989), "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida", Journal of the American Statistical Association 84(406), 414-420.

Winkler, W. E. (2006), "Overview of Record Linkage and Current Research Directions", Tech. report. Statistics #2006-2. Statistical Research Division, U.S. Census Bureau.

Winkler, W., McLaughlin G., Jaro M. and Lynch M. (1994), strcmp95.c, Version 2. United States Census Bureau.

See Also

This comparator reduces to the Jaro comparator when max_prefix = 0L or threshold = 0.0.

Examples

Run this code
# NOT RUN {
## Compare names
JaroWinkler()("Martha", "Mathra")
JaroWinkler()("Eileen", "Phyllis")

## Reduce the threshold for boosting
x <- "Matthew"
y <- "Martin"
JaroWinkler()(x, y) < JaroWinkler(threshold = 0.5)(x, y)

# }

Run the code above in your browser using DataLab