specify_prior: Specify the Prior Distributions

Description

Specify the prior distributions for the $m$ and $u$ parameters of the models for comparison data among matches and non-matches, and the partition.

Usage

specify_prior(
  comparison_list,
  mus = NA,
  nus = NA,
  flat = 0,
  alphas = NA,
  dup_upper_bound = NA,
  dup_count_prior_family = NA,
  dup_count_prior_pars = NA,
  n_prior_family = NA,
  n_prior_pars = NA
)

Value

a list containing:

mus: The hyperparameters of the Dirichlet priors for the m parameters for the comparisons among matches.
nus: The hyperparameters of the Dirichlet priors for the u parameters for the comparisons among non-matches. Includes data from comparisons of record pairs that were declared to not be potential matches using reduce_comparison_data.
flat: A numeric indicator of whether a flat prior for partitions should be used. flat is 1 if a flat prior is used, and flat is 0 if a structured prior is used.
no_dups: A numeric indicator of whether no duplicates are allowed in all of the files.
alphas: The hyperparameters for the Dirichlet-multinomial overlap table prior, a positive numeric vector of length 2 ^ comparison_list$K, where the first element is 0.
alpha_0: The sum of alphas.
dup_upper_bound: A numeric vector indicating the maximum number of duplicates, from each file, allowed in each cluster. For a given file k, dup_upper_bound[k] should be between 1 and comparison_list$file_sizes[k], i.e. even if you don't want to impose an upper bound, you have to implicitly place an upper bound: the number of records in a file.
log_dup_count_prior: A list containing the log density of the prior distribution for the number of duplicates in each cluster, for each file.
log_n_prior: A numeric vector containing the log density of the prior distribution for the number of clusters represented in the records.
nus_specified: The nus before data from comparisons of record pairs that were declared to not be potential matches using reduce_comparison_data are added. Used for input checking.

Arguments

comparison_list: the output from a call to create_comparison_data or reduce_comparison_data. Note that in order to correctly specify the prior, if reduce_comparison_data is used to the reduce the number of record pairs that are potential matches, then the output of reduce_comparison_data (not create_comparison_data) should be used for this argument.
mus, nus: The hyperparameters of the Dirichlet priors for the $m$ and $u$ parameters for the comparisons among matches and non-matches, respectively. These are positive numeric vectors which have length equal to the number of columns of comparison_list$comparisons times the number of file pairs (comparison_list$K * (comparison_list$K + 1) / 2). If set to NA, flat priors are used. We recommend using flat priors for $m$ and $u$ .
flat: A numeric indicator of whether a flat prior for partitions should be used. flat should be 1 if a flat prior is used, and flat should be 0 if a structured prior is used. If a flat prior is used, the remaining arguments should be set to NA. Otherwise, the remaining arguments should be specified. We do not recommend using a flat prior for partitions in general.
alphas: The hyperparameters for the Dirichlet-multinomial overlap table prior, a positive numeric vector of length 2 ^ comparison_list$K - 1. The indexing of these hyperparameters is based on the the comparison_list$K-bit binary representation of the inclusion patterns of the overlap table. To give a few examples, suppose comparison_list$K is 3. 1 in 3-bit binary is 001, so alphas[1] is the hyperparameter for the 001 cell of the overlap table, representing clusters containing only records from the third file. 2 in 3-bit binary is 010, so alphas[2] is the hyperparameter for the 010 cell of the overlap table, representing clusters containing only records from the second file. 3 in 3-bit binary is 011, so alphas[3] is the hyperparameter for the 011 cell of the overlap table, representing clusters containing only records from the second and third files. If set to NA, the hyperparameters will all be set to 1.
dup_upper_bound: A numeric vector indicating the maximum number of duplicates, from each file, allowed in each cluster. For a given file k, dup_upper_bound[k] should be between 1 and comparison_list$file_sizes[k], i.e. even if you don't want to impose an upper bound, you have to implicitly place an upper bound: the number of records in a file. If set to NA, the upper bound for file k will be set to 1 if no duplicates are allowed for that file, or comparison_list$file_sizes[k] if duplicates are allowed for that file.
dup_count_prior_family: A character vector indicating the prior distribution family used for the number of duplicates in each cluster, for each file. Currently the only option is "Poisson" for a Poisson prior, truncated to lie between 1 and dup_upper_bound[k]. The mean parameter of the Poisson distribution is specified using the dup_count_prior_pars argument. If set to NA, a Poisson prior with mean 1 will be used.
dup_count_prior_pars: A list containing the parameters for the prior distribution for the number of duplicates in each cluster, for each file. For file k, when dup_count_prior_family[k]="Poisson", dup_count_prior_pars[[k]] is a positive constant representing the mean of the Poisson prior.
n_prior_family: A character indicating the prior distribution family used for n, the number of clusters represented in the records. Note that this includes records determined not to be potential matches to any other records using reduce_comparison_data. Currently the there are two options: "uniform" for a uniform prior for n, i.e. $p (n) \propto 1$ , and "scale" for a scale prior for n, i.e. $p (n) \propto 1 / n$ . If set to NA, a uniform prior will be used.
n_prior_pars: Currently set to NA. When more prior distribution families for n are implemented, this will be a vector of parameters for those priors.

Details

The purpose of this function is to specify prior distributions for all parameters of the model. Please note that if reduce_comparison_data is used to the reduce the number of record pairs that are potential matches, then the output of reduce_comparison_data (not create_comparison_data) should be used as input.

For the hyperparameters of the Dirichlet priors for the $m$ and $u$ parameters for the comparisons among matches and non-matches, respectively, we recommend using a flat prior. This is accomplished by setting mus=NA and nus=NA. Informative prior specifications are possible, but in practice they will be overwhelmed by the large number of comparisons.

For the prior for partitions, we do not recommend using a flat prior. Instead we recommend using our structure prior for partitions. By setting flat=0 and the remaining arguments to NA, one obtains the default specification for the structured prior that we have found to perform well in simulation studies. The structured prior for partitions is specified as follows:

Specify a prior for n, the number of clusters represented in the records. Note that this includes records determined not to be potential matches to any other records using reduce_comparison_data. Currently, a uniform prior and a scale prior for n are supported. Our default specification uses a uniform prior.
Specify a prior for the overlap table (see the documentation for alphas for more information). Currently a Dirichlet-multinomial prior is supported. Our default specification sets all hyperparameters of the Dirichlet-multinomial prior to 1.
For each file, specify a prior for the number of duplicates in each cluster. As a part of this prior, we specify the maximum number of records in a cluster for each file, through dup_upper_bound. When there are assumed to be no duplicates in a file, the maximum number of records in a cluster for that file is set to 1. When there are assumed to be duplicates in a file, we recommend setting the maximum number of records in a cluster for that file to be less than the file size, if prior knowledge allows. Currently, a Poisson prior for the the number of duplicates in each cluster is supported. Our default specification uses a Poisson prior with mean 1.

Please contact the package maintainer if you need new prior families for n or the number of duplicates in each cluster to be supported.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [tools:::Rd_expr_doi("https://doi.org/10.1080/01621459.2021.2013242")] [arXiv]

Examples

Run this code

# Example with small no duplicate dataset
data(no_dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = no_dup_data_small$file_sizes,
 duplicates = c(0, 0, 0))

# Specify the prior
prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0,
 alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1),
 dup_count_prior_family = NA, dup_count_prior_pars = NA,
 n_prior_family = "uniform", n_prior_pars = NA)

# Example with small duplicate dataset
data(dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = dup_data_small$file_sizes,
 duplicates = c(1, 1, 1))

# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
 (comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
 pairs_to_keep, cc = 1)

# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
 flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
 dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
 dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
 n_prior_pars = NA)

Run the code above in your browser using DataLab