Learn R Programming

⚠️There's a newer version (0.1.3) of this package.Take me there.

comparator: Comparison Functions for Clustering and Record Linkage

comparator implements comparison functions for clustering and record linkage applications. It includes functions for comparing strings, sequences and numeric vectors. Where possible, comparators are implemented in C/C++ to ensure fast performance.

Supported comparators

String comparators:

Edit-based:

  • Levenshtein(): Levenshtein distance/similarity
  • DamerauLevenshtein() Damerau-Levenshtein distance/similarity
  • Hamming(): Hamming distance/similarity
  • OSA(): Optimal String Alignment distance/similarity
  • LCS(): Longest Common Subsequence distance/similarity
  • Jaro(): Jaro distance/similarity
  • JaroWinkler(): Jaro-Winkler distance/similarity

Token-based:

Not yet implemented.

Hybrid token-character:

  • MongeElkan(): Monge-Elkan similarity
  • FuzzyTokenSet(): Fuzzy Token Set distance

Other:

  • InVocabulary(): Compares strings using a reference vocabulary. Useful for comparing names.
  • Lookup(): Retrieves distances/similarities from a lookup table
  • BinaryComp(): Compares strings based on whether they agree/disagree exactly.

Numeric comparators:

  • Euclidean(): Euclidean (L-2) distance
  • Manhattan(): Manhattan (L-1) distance
  • Chebyshev(): Chebyshev (L-∞) distance
  • Minkowski(): Minkowski (L-p) distance

Installation

You can install the latest release from CRAN by entering:

install.packages("comparator")

The development version can be installed from GitHub using devtools:

# install.packages("devtools")
devtools::install_github("ngmarchant/comparator")

Example

A comparator is instantiated by calling its constructor function. For example, we can instantiate a Levenshtein similarity comparator that ignores differences in upper/lowercase characters as follows:

comparator <- Levenshtein(similarity = TRUE, normalize = TRUE, ignore_case = TRUE)

We can apply the comparator to character vectors element-wise as follows:

x <- c("John Doe", "Jane Doe")
y <- c("jonathon doe", "jane doe")
elementwise(comparator, x, y)
#> [1] 0.6666667 1.0000000

# shorthand for above
comparator(x, y)
#> [1] 0.6666667 1.0000000

This comparator is also defined on sequences:

x_seq <- list(c(1, 2, 1, 1), c(1, 2, 3, 4))
y_seq <- list(c(4, 3, 2, 1), c(1, 2, 3, 1))
elementwise(comparator, x_seq, y_seq)
#> [1] 0.4545455 0.7777778

# shorthand for above
comparator(x_seq, y_seq)
#> [1] 0.4545455 0.7777778

Pairwise comparisons are also supported using the following syntax:

# compare each string in x with each string in y and return a similarity matrix
pairwise(comparator, x, y, return_matrix = TRUE)
#>           [,1]      [,2]
#> [1,] 0.6666667 0.6842105
#> [2,] 0.5384615 1.0000000

# compare the strings in x pairwise and return a similarity matrix
pairwise(comparator, x, return_matrix = TRUE)
#>           [,1]      [,2]
#> [1,] 1.0000000 0.6842105
#> [2,] 0.6842105 1.0000000

Copy Link

Version

Install

install.packages('comparator')

Monthly Downloads

805

Version

0.1.2

License

GPL (>= 2)

Issues

Pull Requests

Stars

Forks

Maintainer

Neil Marchant

Last Published

March 16th, 2022

Functions in comparator (0.1.2)

FuzzyTokenSet

Fuzzy Token Set Comparator
Chebyshev

Chebyshev Numeric Comparator
BinaryComp

Binary String/Sequence Comparator
Hamming

Hamming String/Sequence Comparator
InVocabulary

In-Vocabulary Comparator
DamerauLevenshtein

Damerau-Levenshtein String/Sequence Comparator
CppSeqComparator-class

Virtual Class for a Sequence Comparator with a C++ Implementation
MongeElkan

Monge-Elkan Token Comparator
Minkowski

Minkowski Numeric Comparator
Lookup

Lookup String Comparator
Comparator-class

Virtual Comparator Class
Manhattan

Manhattan Numeric Comparator
Levenshtein

Levenshtein String/Sequence Comparator
LCS

Longest Common Subsequence (LCS) Comparator
pairwise

Pairwise Similarity/Distance Matrix
comparator-package

comparator: Comparison Functions for Clustering and Record Linkage
NumericComparator-class

Virtual Numeric Comparator Class
elementwise

Elementwise Similarity/Distance Vector
Constant

Constant String/Sequence Comparator
OSA

Optimal String Alignment (OSA) String/Sequence Comparator
hmean

Harmonic Mean
gmean

Geometric Mean
PairwiseMatrix-class

Pairwise Similarity/Distance Matrix
SequenceComparator-class

Virtual Sequence Comparator Class
Jaro

Jaro String/Sequence Comparator
JaroWinkler

Jaro-Winkler String/Sequence Comparator
TokenComparator-class

Virtual Token Comparator Class
StringComparator-class

Virtual String Comparator Class
Euclidean

Euclidean Numeric Comparator