Use indexing to reduce the number of record pairs that are potential matches.
reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)
a list containing:
record_pairs
A data.frame
, where each row
contains the pair of records being compared in the corresponding row of
comparisons
. The rows are sorted in ascending order according to the
first column, with ties broken according to the second column in ascending
order. For any given row, the first column is less than the second column,
i.e. record_pairs[i, 1] < record_pairs[i, 2]
for each row i
.
If according to pairs_to_keep
there are records which are not
potential matches to any other records, the remaining records are
relabeled (see labels
).
comparisons
A logical
matrix, where each row contains
the comparisons between the record pair in the corresponding row of
record_pairs
. Comparisons are in the same order as the columns of
records
, and are represented by L + 1
columns of
TRUE/FALSE
indicators, where L + 1
is the number of
disagreement levels for the field based on breaks
.
K
The number of files, assumed to be of class
numeric
.
file_sizes
A numeric
vector of length K
,
indicating the size of each file. If according to pairs_to_keep
there are records which are not potential matches to any other records, the
remaining records are relabeled (see labels
), and file_sizes
now represents the sizes of each file after removing such records.
duplicates
A numeric
vector of length K
,
indicating which files are assumed to have duplicates. duplicates[k]
should be 1
if file k
has duplicates, and
duplicates[k]
should be 0
if file k
has no
duplicates.
field_levels
A numeric
vector indicating the number of
disagreement levels for each field.
file_labels
An integer
vector of length
sum(file_sizes)
, where file_labels[i]
indicated which file
record i
is in.
fp_matrix
An integer
matrix, where
fp_matrix[k1, k2]
is a label for the file pair (k1, k2)
. Note
that fp_matrix[k1, k2] = fp_matrix[k2, k1]
.
rp_to_fp
A logical
matrix that indicates which record
pairs belong to which file pairs. rp_to_fp[fp, rp]
is TRUE
if
the records record_pairs[rp, ]
belong to the file pair fp
,
and is FALSE otherwise. Note that fp
is given by the labeling in
fp_matrix
.
ab
An integer
vector, of length
ncol(comparisons) * K * (K + 1) / 2
that indicates how many record
pairs there are with a given disagreement level for a given field, for each
file pair.
file_sizes_not_included
If according to pairs_to_keep
there are records which are not potential matches to any other records, the
remaining records are relabeled (see labels
), and
file_sizes_not_included
indicates, for each file, the number of such
records that were removed.
ab_not_included
For record pairs not included according to
pairs_to_keep
, this is an integer
vector, of length
ncol(comparisons) * K * (K + 1) / 2
that indicates how many record
pairs there are with a given disagreement level for a given field, for each
file pair.
labels
If according to pairs_to_keep
there are records which are not potential matches to any other records, the
remaining records are relabeled. labels
provides a dictionary that
indicates, for each of the new labels, which record in the original
labeling the new label corresponds to. In particular, the first column
indicates the record in the original labeling, and the second column
indicates the new labeling.
pairs_to_keep
A logical
vector, the same length as
comparison_list$record_pairs
, indicating which record pairs were
kept as potential matches. This may not be the same as the input
pairs_to_keep
if cc
was set to 1.
cc
A numeric
indicator of whether the connected
components of the potential matches are closed under transitivity.
The output of a call to
create_comparison_data
.
A logical
vector, the same length as
comparison_list$record_pairs
, indicating which record pairs should be
kept as potential matches. These potential matches do not have to be
transitive (see the argument cc
).
A numeric
indicator of whether to find the transitive
closure of pairs_to_keep
, and use these potential matches instead
of just those from pairs_to_keep
. cc
should be 1
if the
transitive closure is being used, and cc
should be 0
if the
transitive closure is not being used. We recommend setting cc
to
1
.
When using comparison-based record linkage methods, scalability is a concern,
as the number of record pairs is quadratic in the number of records. In
order to address these concerns, it's common to declare certain record pairs
to not be potential matches a priori, using indexing methods. The user is
free to index using any method they like, as long as they can produce a
logical
vector that indicates which record pairs are potential matches
according to their indexing method. We recommend, if the user chosen indexing
method does not output potential matches that are transitive, to set the
cc
argument to 1
. By transitive we mean, for any three records
\(i\), \(j\), and \(k\), if \(i\) and \(j\) are potential matches,
and \(j\) and \(k\) are potential matches, then \(i\) and \(k\) are
potential matches. Non-transitive indexing schemes can lead to poor mixing of
the Gibbs sampler used for posterior inference, and suggests that the
indexing method used may have been too stringent.
If indexing is used, it may be the case that some records are declared to not
be potential matches to any other records. In this case, the indexing method
has made the decision that these records have no matches, and thus we can
remove them from the data set and relabel the remaining records; see the
documentation for labels
for information on how to go between the
original labeling and the new labeling.
If indexing is used, comparisons for record pairs that aren't potential matches are still used during inference, where they're used to inform the distribution of comparisons for non-matches.
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [tools:::Rd_expr_doi("https://doi.org/10.1080/01621459.2021.2013242")][arXiv]
# Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)
Run the code above in your browser using DataLab