initialize_partition: Initialize the Partition

Description

Generate an initialization for the partition in the case when it is assumed there are no duplicates in all files (so that the partition is a matching).

Usage

initialize_partition(comparison_list, pairs_to_keep, seed = NA)

Value

an integer vector of arbitrary labels of length sum(comparison_list$file_sizes), giving an initialization for the partition.

Arguments

comparison_list: the output from a call to create_comparison_data or reduce_comparison_data. Note that in order to correctly specify the initialization, if reduce_comparison_data is used to the reduce the number of record pairs that are candidate matches, then the output of reduce_comparison_data (not create_comparison_data) should be used for this argument.
pairs_to_keep: A logical vector, the same length as comparison_list$record_pairs, indicating which record pairs are potential matches in the initialization.
seed: The seed to use to generate the initialization.

Details

When it is assumed that there are no duplicates in all files, and reduce_comparison_data is not used to reduce the number of potential matches, the Gibbs sampler used for posterior inference may experience slow mixing when using an initialization for the partition where each record is in its own cluster (the default option for the Gibbs sampler). The purpose of this function is to provide an alternative initialization scheme.

To use this initialization scheme, the user passes in a logical vector that indicates which record pairs are potential matches according to an indexing method (as in reduce_comparison_data). Note that this indexing is only used to generate the initialization, it is not used for inference. The initialization scheme first finds the transitive closure of the potential matches, which partitions the records into blocks. Within each block of records, the scheme randomly selects a record from each file, and these selected records are then placed in the same cluster for the partition initialization. All other records are placed in their own clusters.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [tools:::Rd_expr_doi("https://doi.org/10.1080/01621459.2021.2013242")][arXiv]

Examples

Run this code

# Example with small no duplicate dataset
data(no_dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = no_dup_data_small$file_sizes,
 duplicates = c(0, 0, 0))

# Find initialization for the matching
# The following line corresponds to only keeping pairs of records as
# potential matches in the initialization for which neither gname nor fname
# disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
 (comparison_list$comparisons[, "fname_DL_3"] != TRUE)
Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)

Run the code above in your browser using DataLab