gibbs_sampler: Gibbs Sampler for Posterior Inference

Description

Run a Gibbs sampler to explore the posterior distribution of partitions of records.

Usage

gibbs_sampler(
  comparison_list,
  prior_list,
  n_iter = 2000,
  Z_init = 1:sum(comparison_list$file_sizes),
  seed = 70,
  single_likelihood = FALSE,
  chaperones_info = NA,
  verbose = TRUE
)

Value

a list containing:

m: Posterior samples of the m parameters. Each column is one sample.
u: Posterior samples of the u parameters. Each column is one sample.
partitions: Posterior samples of the partition. Each column is one sample. Note that the partition is represented as an integer vector of arbitrary labels of length sum(comparison_list$file_sizes).
contingency_tables: Posterior samples of the overlap table. Each column is one sample. This incorporates counts of records determined not to be candidate matches to any other records using reduce_comparison_data.
cluster_sizes: Posterior samples of the size of each cluster (associated with an arbitrary label from 1 to sum(comparison_list$file_sizes)). Each column is one sample.
sampling_time: The time in seconds it took to run the sampler.

Arguments

comparison_list: The output from a call to create_comparison_data or reduce_comparison_data.
prior_list: The output from a call to specify_prior.
n_iter: The number of iterations of the Gibbs sampler to run.
Z_init: Initialization of the partition of records, represented as an integer vector of arbitrary labels of length sum(comparison_list$file_sizes). The default initialization places each record in its own cluster. See initialize_partition for an alternative initialization when there are no duplicates in each file.
seed: The seed to use while running the Gibbs sampler.
single_likelihood: A logical indicator of whether to use a single likelihood for comparisons for all file pairs, or whether to use a separate likelihood for comparisons for each file pair. When single_likelihood=TRUE, a single likelihood is used, and the prior hyperparameters for m and u from the first file pair are used. We do not recommend using a single likelihood in general.
chaperones_info: If chaperones_info is set to NA, then Gibbs updates to the partition are used during the Gibbs sampler, as described in Aleshin-Guendel & Sadinle (2022). Else, Chaperones updates, as described in Miller et al. (2015) and Betancourt et al. (2016), are used and chaperones_info should be a list with five elements controlling Chaperones updates to the partition during the Gibbs sampler: chap_type, num_chap_iter, nonuniform_chap_type, extra_gibbs, num_restrict. chap_type is 0 if using a uniform Chaperones distribution, and 1 if using a nonuniform Chaperones distribution. num_chap_iter is the number of Chaperones updates to the partition that are made during each iteration of the Gibbs sampler. When using a nonuniform Chaperones distribution, nonuniform_chap_type is 0 if using the exact version, or 1 if using the partial version. extra_gibbs is a logical indicator of whether a Gibbs update to the partition should be done after the Chaperones updates, at each iteration of the Gibbs sampler. num_restrict is the number of restricted Gibbs steps to take during each Chaperones update to the partition.
verbose: A logical indicator of whether progress messages should be print (default TRUE).

Details

Given the prior specified using specify_prior, this function runs a Gibbs sampler to explore the posterior distribution of partitions of records, conditional on the comparison data created using create_comparison_data or reduce_comparison_data.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [tools:::Rd_expr_doi("https://doi.org/10.1080/01621459.2021.2013242")][arXiv]

Jeffrey Miller, Brenda Betancourt, Abbas Zaidi, Hanna Wallach, & Rebecca C. Steorts (2015). Microclustering: When the cluster sizes grow sublinearly with the size of the data set. NeurIPS Bayesian Nonparametrics: The Next Generation Workshop Series. [arXiv]

Brenda Betancourt, Giacomo Zanella, Jeffrey Miller, Hanna Wallach, Abbas Zaidi, & Rebecca C. Steorts (2016). Flexible Models for Microclustering with Application to Entity Resolution. Advances in neural information processing systems. [Published] [arXiv]

Examples

Run this code

# Example with small no duplicate dataset
data(no_dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = no_dup_data_small$file_sizes,
 duplicates = c(0, 0, 0))

# Specify the prior
prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0,
 alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1),
 dup_count_prior_family = NA, dup_count_prior_pars = NA,
 n_prior_family = "uniform", n_prior_pars = NA)

# Find initialization for the matching (this step is optional)
# The following line corresponds to only keeping pairs of records as
# potential matches in the initialization for which neither gname nor fname
# disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
 (comparison_list$comparisons[, "fname_DL_3"] != TRUE)
Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)

# Run the Gibbs sampler
{
results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000,
 Z_init = Z_init, seed = 42)
}

# Example with small duplicate dataset
data(dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = dup_data_small$file_sizes,
 duplicates = c(1, 1, 1))

# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
 (comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
 pairs_to_keep, cc = 1)

# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
 flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
 dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
 dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
 n_prior_pars = NA)

# Run the Gibbs sampler
{
results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000,
 seed = 42)
}

Run the code above in your browser using DataLab