relabel_bayes_estimate: Relabel the Bayes Estimate of a Partition

Description

Relabel the Bayes estimate of a partition, for use after using indexing to reduce the number of record pairs that are potential matches.

Usage

relabel_bayes_estimate(reduced_comparison_list, bayes_estimate)

Value

A data.frame, with as many rows as sum(reduced_comparison_list$file_sizes + reduced_comparison_list$file_sizes_not_included), i.e. the number of records originally input to create_comparison_data, before indexing occurred. This data.frame has two columns, "original_labels" and "link_id". Given row i of records originally input to create_comparison_data, the linkage id according to bayes_estimate is given by the ith row of the link_id column. See the documentation for find_bayes_estimate for information on how to interpret this linkage id.

Arguments

reduced_comparison_list: The output from a call to reduce_comparison_data.
bayes_estimate: The output from a call to find_bayes_estimate.

Details

When the function reduce_comparison_data is used to reduce the number of record pairs that are potential matches, it may be the case that some records are declared to not be potential matches to any other records. In this case, the indexing method has made the decision that these records have no matches, and thus we can remove them from the data set and relabel the remaining records; see the documentation for labels in reduce_comparison_data for information on how to go between the original labeling and the new labeling. The purpose of this function is to relabel the output of find_bayes_estimate when the function reduce_comparison_data is used, so that the user doesn't have to do this relabeling themselves.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [tools:::Rd_expr_doi("https://doi.org/10.1080/01621459.2021.2013242")][arXiv]

Examples

Run this code

# Example with small duplicate dataset
data(dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = dup_data_small$file_sizes,
 duplicates = c(1, 1, 1))

# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
 (comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
 pairs_to_keep, cc = 1)

# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
 flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
 dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
 dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
 n_prior_pars = NA)

# Run the Gibbs sampler
{
results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000,
 seed = 42)

# Find the full Bayes estimate
full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
 L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)

# Find the partial Bayes estimate
partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
 L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)

# Relabel the full and partial Bayes estimates
full_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list,
 full_estimate)

partial_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list,
 partial_estimate)

# Add columns to the records corresponding to their full and partial
# Bayes estimates
dup_data_small$records <- cbind(dup_data_small$records,
 full_estimate_id = full_estimate_relabel$link_id,
 partial_estimate_id = partial_estimate_relabel$link_id)
}

Run the code above in your browser using DataLab