representr: representr: A package for creating representative records post-record linkage.

Description

The representr package provides two types of representative record generation: point prototyping and posterior prototyping.

Arguments

Point Prototyping

To bridge the gap between record linkage and a downstream task, there are three methods to choose or create the representative records from linked data: random prototyping, minimax prototyping, and composite. These are all based on a point estimate of the linkage structure post-record linkage ( rather than a posterior distribution).

Random prototyping chooses a record from each cluster at random, either uniformly or according to a supplied distribution. Minimax prototyping selects the record whose farthest neighbors within the cluster is closest, based on some notion of closeness that is measured by a record distance function. There are two distance functions included in this package (binary and column-based), or the user can specify their own. Composite record creation constructs the representative record by aggregating the records (in each cluster) to form a composite record that includes information from each linked record.

Each of these three types of prototyping can be used from the function represent.

Posterior prototyping

The posterior distribution of the linkage can be used in two ways in this package. The first, is as weights or in a distance function for the above point prototyping methods. The second, is through the posterior prototyping (PP) weights presented in Kaplan, Betancourt, and Steorts (2018+). The PP weights are accessible through the pp_weights function.

References

Kaplan, Andee, Brenda Betancourt, and Rebecca C. Steorts. "Posterior Prototyping: Bridging the Gap between Bayesian Record Linkage and Regression." arXiv preprint arXiv:1810.01538 (2018).