Create Representative Records After Entity Resolution
An implementation of Kaplan, Betancourt, Steorts (2020) <arXiv:1810.01538> that creates representative records for use in downstream tasks after entity resolution is performed. Multiple methods for creating the representative records (data sets) are provided.
Create representative records post-record linkage for use in downstream tasks after record linkage is performed. Multiple methods for creating the records are provided, including two methods based on the posterior distribution of linkage resulting from a Bayesian analysis.
devtools::install_github("cleanzr/representr", build_vignettes = TRUE)
This package implements the methods introduced in the following paper:
Kaplan, Andee, Brenda Betancourt, and Rebecca C. Steorts. "Posterior Prototyping: Bridging the Gap between Bayesian Record Linkage and Regression." arXiv preprint arXiv:1810.01538 (2020).
Record linkage (entity resolution or de-duplication) is used to join multiple databases to remove duplicate entities. While record linkage removes the duplicate entities from the data, many researchers are interested in performing inference, prediction, or post-linkage analysis on the linked data (e.g., regression or capture-recapture), which we call the downstream task. Depending on the downstream task, one may wish to find the most representative record before performing the post-linkage analysis. For example, when the values of features used in a downstream task differ for linked data, which values should be used? This is where
representr comes in.
The two main functions in
pp_weights, which perform pointwise and fully Bayesian prototyping, respectively. Additionally, we have added a function aid in the evaluation of prototyping methods by estimating an empirical KL divergence through the function
emp_kl. To read more about the specific prototyping functions available, see the help pages.
For more extensive documentation of the use of this package, please see the vignette.
This work was partially supported by the National Science Foundation through NSF-1652431 and NSF-1534412 and the Laboratory for Analytic Sciences at NC State University.
Functions in representr
|representr||representr: A package for creating representative records post-record linkage.|
|pp_weights||Get posterior weights for each record post record-linkage using posterior prototyping.|
|clust_composite||Composite record from a cluster using a weighted average of each column values.|
|represent||Create a representative dataset post record-linkage.|
|rl_reg1||500 records suitable for record linkage with additional regression variables|
|clust_proto_random||Prototype record from a cluster.|
|dist_binary||The distance between two records|
|emp_kl_div||Calculate the empirical KL divergence for a representative dataset as compared to the true dataset|
Vignettes of representr
Last month downloads
|Packaged||2020-10-20 18:52:58 UTC; andeek|
|Date/Publication||2020-10-20 20:30:03 UTC|
Include our badge in your README