Overview
Description
This R package is designed to block records for data deduplication and
record linkage (also known as entity resolution) using approximate
nearest neighbours algorithms
(ANN) and graphs
(via the igraph package).
It supports the following R packages that bind to specific ANN algorithms:
- rnndescent (default, very powerful, supports sparse matrices),
- RcppHNSW (powerful but does not support sparse matrices),
- RcppAnnoy,
- mlpack (see
mlpack::lshandmlpack::knn).
The package can be used with the
reclin2 package via the
blocking::pair_ann function.
Installation
Install the GitHub blocking package with:
# install.packages("remotes") # uncomment if needed
remotes::install_github("ncn-foreigners/blocking")Basic usage
Load packages for the examples
library(blocking)
library(reclin2)
#> Ładowanie wymaganego pakietu: data.table
#> data.table 1.17.0 using 6 threads (see ?getDTthreads). Latest news: r-datatable.comGenerate simple data with three groups (df_example) and reference data
(df_base).
df_example <- data.frame(txt = c(
"jankowalski",
"kowalskijan",
"kowalskimjan",
"kowaljan",
"montypython",
"pythonmonty",
"cyrkmontypython",
"monty"
))
df_base <- data.frame(txt = c("montypython", "kowalskijan", "other"))
df_example
#> txt
#> 1 jankowalski
#> 2 kowalskijan
#> 3 kowalskimjan
#> 4 kowaljan
#> 5 montypython
#> 6 pythonmonty
#> 7 cyrkmontypython
#> 8 monty
df_base
#> txt
#> 1 montypython
#> 2 kowalskijan
#> 3 otherDeduplication using the blocking function. Output contains
information:
- the method used (where
nndwhich refers to the NN descent algorithm), - number of blocks created (here 2 blocks),
- number of columns used for blocking, i.e. how many shingles were
created by
text2vecpackage (here 28), - reduction ratio, i.e. how large is the reduction of comparison pairs (here 0.5714 which means blocking reduces comparison by over 57%).
blocking_result <- blocking(x = df_example$txt)
blocking_result
#> ========================================================
#> Blocking based on the nnd method.
#> Number of blocks: 2.
#> Number of columns used for blocking: 28.
#> Reduction ratio: 0.5714.
#> ========================================================
#> Distribution of the size of the blocks:
#> 4
#> 2Table with blocking results contains:
- row numbers from the original data,
- block number (integers),
- distance (from the ANN algorithm).
blocking_result$result
#> x y block dist
#> <int> <int> <num> <num>
#> 1: 1 2 1 0.10000002
#> 2: 2 3 1 0.14188367
#> 3: 2 4 1 0.28286284
#> 4: 5 6 2 0.08333331
#> 5: 5 7 2 0.13397455
#> 6: 5 8 2 0.27831215Deduplication using the pair_ann function for integration with the
reclin2 package. Use the pipeline with the reclin2 package.
pair_ann(x = df_example, on = "txt") |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.55) |>
link(selection = "threshold")
#> Total number of pairs: 8 pairs
#>
#> Key: <.y>
#> .y .x txt.x txt.y
#> <int> <int> <char> <char>
#> 1: 2 1 jankowalski kowalskijan
#> 2: 3 1 jankowalski kowalskimjan
#> 3: 3 2 kowalskijan kowalskimjan
#> 4: 4 1 jankowalski kowaljan
#> 5: 4 2 kowalskijan kowaljan
#> 6: 6 5 montypython pythonmonty
#> 7: 7 5 montypython cyrkmontypython
#> 8: 8 5 montypython montyLinking records using the same function where df_base is the
“register” and df_example is the reference (data).
pair_ann(x = df_base, y = df_example, on = "txt", deduplication = FALSE) |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.55) |>
link(selection = "threshold")
#> Total number of pairs: 8 pairs
#>
#> Key: <.y>
#> .y .x txt.x txt.y
#> <int> <int> <char> <char>
#> 1: 1 2 kowalskijan jankowalski
#> 2: 2 2 kowalskijan kowalskijan
#> 3: 3 2 kowalskijan kowalskimjan
#> 4: 4 2 kowalskijan kowaljan
#> 5: 5 1 montypython montypython
#> 6: 6 1 montypython pythonmonty
#> 7: 7 1 montypython cyrkmontypython
#> 8: 8 1 montypython montySee also
See section Data Integration (Statistical Matching and Record Linkage)
in the Official Statistics Task
View.
Packages that allow blocking:
- klsh – k-means locality sensitive hashing,
- reclin2 –
pair_blocking,pari_minsimfunctions, - fastLink –
blockDatafunction.
Other:
- clevr – evaluation of clustering, helper functions.
- exchanger – bayesian Entity Resolution with Exchangeable Random Partition Priors
Funding
Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941.