Specify the prior distributions for the
specify_prior(
comparison_list,
mus = NA,
nus = NA,
flat = 0,
alphas = NA,
dup_upper_bound = NA,
dup_count_prior_family = NA,
dup_count_prior_pars = NA,
n_prior_family = NA,
n_prior_pars = NA
)
a list containing:
mus
The hyperparameters of the Dirichlet priors for the
m
parameters for the comparisons among matches.
nus
The hyperparameters of the Dirichlet priors for the
u
parameters for the comparisons among non-matches. Includes data
from comparisons of record pairs that were declared to not be potential
matches using reduce_comparison_data
.
flat
A numeric
indicator of whether a flat prior for
partitions should be used. flat
is 1
if a flat prior is used,
and flat
is 0
if a structured prior is used.
no_dups
A numeric
indicator of whether no duplicates
are allowed in all of the files.
alphas
The hyperparameters for the Dirichlet-multinomial
overlap table prior, a positive numeric
vector of length
2 ^ comparison_list$K
, where the first element is 0
.
alpha_0
The sum of alphas
.
dup_upper_bound
A numeric
vector indicating the
maximum number of duplicates, from each file, allowed in each cluster. For
a given file k
, dup_upper_bound[k]
should be between 1
and comparison_list$file_sizes[k]
, i.e. even if you don't want to
impose an upper bound, you have to implicitly place an upper bound: the
number of records in a file.
log_dup_count_prior
A list
containing the log density
of the prior distribution for the number of duplicates in each cluster, for
each file.
log_n_prior
A numeric
vector containing the log
density of the prior distribution for the number of clusters represented in
the records.
nus_specified
The nus
before data from comparisons of
record pairs that were declared to not be potential matches using
reduce_comparison_data
are added. Used for input checking.
the output from a call to
create_comparison_data
or reduce_comparison_data
.
Note that in order to correctly specify the prior, if
reduce_comparison_data
is used to the reduce the number of
record pairs that are potential matches, then the output of
reduce_comparison_data
(not
create_comparison_data
) should be used for this argument.
The hyperparameters of the Dirichlet priors for the numeric
vectors which have length
equal to the number of columns of comparison_list$comparisons
times
the number of file pairs
(comparison_list$K * (comparison_list$K + 1) / 2)
. If set to
NA
, flat priors are used. We recommend using flat priors for
A numeric
indicator of whether a flat prior for partitions
should be used. flat
should be 1
if a flat prior is used, and
flat
should be 0
if a structured prior is used. If a flat prior
is used, the remaining arguments should be set to NA
. Otherwise, the
remaining arguments should be specified. We do not recommend using a flat
prior for partitions in general.
The hyperparameters for the Dirichlet-multinomial overlap table
prior, a positive numeric
vector of length
2 ^ comparison_list$K - 1
. The indexing of these hyperparameters is
based on the the comparison_list$K
-bit binary representation of the
inclusion patterns of the overlap table. To give a few examples, suppose
comparison_list$K
is 3
. 1
in 3
-bit binary is
001
, so alphas[1]
is the hyperparameter for the
001
cell of the overlap table, representing clusters containing only
records from the third file. 2
in 3
-bit binary is
010
, so alphas[2]
is the hyperparameter for the
010
cell of the overlap table, representing clusters containing only
records from the second file. 3
in 3
-bit binary is
011
, so alphas[3]
is the hyperparameter for the
011
cell of the overlap table, representing clusters containing only
records from the second and third files. If set to NA
, the
hyperparameters will all be set to 1
.
A numeric
vector indicating the maximum number
of duplicates, from each file, allowed in each cluster. For a given file
k
, dup_upper_bound[k]
should be between 1
and
comparison_list$file_sizes[k]
, i.e. even if you don't want to impose
an upper bound, you have to implicitly place an upper bound: the number of
records in a file. If set to NA
, the upper bound for file k
will be set to 1
if no duplicates are allowed for that file, or
comparison_list$file_sizes[k]
if duplicates are allowed for that file.
A character
vector indicating the
prior distribution family used for the number of duplicates in each cluster,
for each file. Currently the only option is "Poisson"
for a Poisson
prior, truncated to lie between 1
and dup_upper_bound[k]
. The
mean parameter of the Poisson distribution is specified using the
dup_count_prior_pars
argument. If set to NA
, a Poisson prior
with mean 1
will be used.
A list
containing the parameters for
the prior distribution for the number of duplicates in each cluster, for each
file. For file k
, when dup_count_prior_family[k]="Poisson"
,
dup_count_prior_pars[[k]]
is a positive constant representing the mean
of the Poisson prior.
A character
indicating the prior distribution
family used for n
, the number of clusters represented in the
records. Note that this includes records determined not to be potential
matches to any other records using reduce_comparison_data
.
Currently the there are two options: "uniform"
for a uniform prior
for n
, i.e. "scale"
for a scale prior
for n
, i.e. NA
, a uniform
prior will be used.
Currently set to NA
. When more prior distribution
families for n
are implemented, this will be a vector of parameters
for those priors.
The purpose of this function is to specify prior distributions for all
parameters of the model. Please note that if
reduce_comparison_data
is used to the reduce the number of
record pairs that are potential matches, then the output of
reduce_comparison_data
(not
create_comparison_data
) should be used as input.
For the hyperparameters of the Dirichlet priors for the mus=NA
and nus=NA
. Informative prior specifications
are possible, but in practice they will be overwhelmed by the large number of
comparisons.
For the prior for partitions, we do not recommend using a flat prior. Instead
we recommend using our structure prior for partitions. By setting
flat=0
and the remaining arguments to NA
, one obtains the
default specification for the structured prior that we have found to perform
well in simulation studies. The structured prior for partitions is specified
as follows:
Specify a prior for n
, the number of clusters represented in
the records. Note that this includes records determined not to be potential
matches to any other records using reduce_comparison_data
.
Currently, a uniform prior and a scale prior for n
are supported.
Our default specification uses a uniform prior.
Specify a prior for the overlap table (see the documentation for
alphas
for more information). Currently a Dirichlet-multinomial
prior is supported. Our default specification sets all hyperparameters of
the Dirichlet-multinomial prior to 1
.
For each file, specify a prior for the number of duplicates in each
cluster. As a part of this prior, we specify the maximum number of records
in a cluster for each file, through dup_upper_bound
. When there
are assumed to be no duplicates in a file, the maximum number of records in
a cluster for that file is set to 1
. When there are assumed to be
duplicates in a file, we recommend setting the maximum number of records in
a cluster for that file to be less than the file size, if prior knowledge
allows. Currently, a Poisson prior for the the number of duplicates in
each cluster is supported. Our default specification uses a Poisson prior
with mean 1
.
Please contact the package maintainer if you need new prior families
for n
or the number of duplicates in each cluster to be supported.
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [tools:::Rd_expr_doi("https://doi.org/10.1080/01621459.2021.2013242")] [arXiv]
# Example with small no duplicate dataset
data(no_dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = no_dup_data_small$file_sizes,
duplicates = c(0, 0, 0))
# Specify the prior
prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0,
alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1),
dup_count_prior_family = NA, dup_count_prior_pars = NA,
n_prior_family = "uniform", n_prior_pars = NA)
# Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)
# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
n_prior_pars = NA)
Run the code above in your browser using DataLab