A dataset containing 867 simulated records from 3 files with
no duplicate records in each file.
Usage
dup_data
Arguments
Format
A list with three elements:
records
A data.frame with the records, containing 7
fields, from all three files, in the format used for input to
create_comparison_data.
file_sizes
The size of each file.
IDs
The true partition of the records, represented as an
integer vector of arbitrary labels of length
sum(file_sizes).
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the
American Statistical Association. [tools:::Rd_expr_doi("https://doi.org/10.1080/01621459.2021.2013242")][arXiv]