FuzzyTokenSet: Fuzzy Token Set Comparator

Description

Compares a pair of token sets \(x\) and \(y\) by computing the minimum cost of transforming \(x\) into \(y\) using single-token operations (insertions, deletions and substitutions). The cost of a single-token operations os determined at the character-level using an internal string comparator.

Usage

FuzzyTokenSet(
  inner_comparator = Levenshtein(normalize = TRUE),
  agg_function = base::mean,
  deletion = 1,
  insertion = 1,
  substitution = 1
)

Arguments

inner_comparator

inner string distance comparator of class '>StringComparator. Defaults to normalized Levenshtein distance.

agg_function

function used to aggregate the costs of the optimal operations. Defaults to base::mean.

deletion

positive weight associated with deletion of a token. Defaults to unit cost.

insertion

positive weight associated insertion of a token. Defaults to unit cost.

substitution

positive weight associated with substitution of a token. Defaults to unit cost.

Details

A token set is an unordered enumeration of tokens, which may include duplicates. Given two token sets \(x\) and \(y\), this comparator computes the minimum cost of transforming \(x\) into \(y\) using the following single-token operations:

deleting a token \(a\) from \(x\) at cost \(w_d \times \mathrm{inner}(a, "")\)
inserting a token \(b\) in \(y\) at cost \(w_i \times \mathrm{inner}("", b)\)
substituting a token \(a\) in \(x\) for a token \(b\) in \(y\) at cost \(w_s \times \mathrm{inner}(a, b)\)

where \(\mathrm{inner}\) is an internal string distance comparator and \(w_d, w_i, w_s\) are positive weights, referred to as deletion, insertion and substitution in the parameter list. By default, the mean cost of the optimal (cost-minimizing) set of operations is returned. Other methods of aggregating the costs of the optimal operations are supported by specifying a non-default agg_function.

The optimization problem---minimizing the total cost under the allowed operations---is solved exactly using a linear sum assignment solver.

Examples

Run this code

# NOT RUN {
## Compare names with heterogenous representations
x <- "The University of California - San Diego"
y <- "Univ. Calif. San Diego"
# Tokenize strings on white space
x <- strsplit(x, '\\s+')
y <- strsplit(y, '\\s+')
FuzzyTokenSet()(x, y)
# Reduce the cost associated with missing words
FuzzyTokenSet(deletion = 0.5, insertion = 0.5)(x, y)

## Compare full name with abbreviated name, reducing the penalty 
## for dropping parts of the name
fullname <- "JOSE ELIAS TEJADA BASQUES"
name <- "JOSE BASQUES"
# Tokenize strings on white space
fullname <- strsplit(fullname, '\\s+')
name <- strsplit(name, '\\s+')
comparator <- FuzzyTokenSet(deletion = 0.5)
comparator(fullname, name) < comparator(name, fullname) # TRUE

# }

Run the code above in your browser using DataLab