Last chance! 50% off unlimited learning
Sale ends in
Function for the integration with the reclin2 package. The function is based on pair_minsim and reuses some of its source code.
pair_ann(
x,
y = NULL,
on,
deduplication = TRUE,
keep_block = TRUE,
add_xy = TRUE,
...
)
Returns a data.table with two columns .x
and .y
. Columns .x
and .y
are row numbers from data.frames x and y respectively.
Returned data.table
is also of a class pairs
which allows for integration with the compare_pairs function.
reference data (a data.frame or a data.table),
query data (a data.frame or a data.table, default NULL),
a character with column name or a character vector with column names for the ANN search,
whether deduplication should be performed (default TRUE),
whether to keep the block variable in the set,
whether to add x and y,
arguments passed to blocking function.
Maciej Beręsewicz
# example using two datasets from reclin2
# \donttest{
if (requireNamespace("reclin2", quietly = TRUE)) {
library(reclin2)
data("linkexample1", "linkexample2", package = "reclin2")
linkexample1$txt <- with(linkexample1, tolower(paste0(firstname, lastname, address, sex, postcode)))
linkexample1$txt <- gsub("\\s+", "", linkexample1$txt)
linkexample2$txt <- with(linkexample2, tolower(paste0(firstname, lastname, address, sex, postcode)))
linkexample2$txt <- gsub("\\s+", "", linkexample2$txt)
# pairing records from linkexample2 to linkexample1 based on txt column
pair_ann(x = linkexample1, y = linkexample2, on = "txt", deduplication = FALSE) |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.75) |>
link(selection = "threshold")
}
# }
Run the code above in your browser using DataLab