Use indexing to reduce the number of record pairs that are potential matches.
reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)a list containing:
record_pairsA data.frame, where each row
contains the pair of records being compared in the corresponding row of
comparisons. The rows are sorted in ascending order according to the
first column, with ties broken according to the second column in ascending
order. For any given row, the first column is less than the second column,
i.e. record_pairs[i, 1] < record_pairs[i, 2] for each row i.
If according to pairs_to_keep there are records which are not
potential matches to any other records, the remaining records are
relabeled (see labels).
comparisonsA logical matrix, where each row contains
the comparisons between the record pair in the corresponding row of
record_pairs. Comparisons are in the same order as the columns of
records, and are represented by L + 1 columns of
TRUE/FALSE indicators, where L + 1 is the number of
disagreement levels for the field based on breaks.
KThe number of files, assumed to be of class
numeric.
file_sizesA numeric vector of length K,
indicating the size of each file. If according to pairs_to_keep
there are records which are not potential matches to any other records, the
remaining records are relabeled (see labels), and file_sizes
now represents the sizes of each file after removing such records.
duplicatesA numeric vector of length K,
indicating which files are assumed to have duplicates. duplicates[k]
should be 1 if file k has duplicates, and
duplicates[k] should be 0 if file k has no
duplicates.
field_levelsA numeric vector indicating the number of
disagreement levels for each field.
file_labelsAn integer vector of length
sum(file_sizes), where file_labels[i] indicated which file
record i is in.
fp_matrixAn integer matrix, where
fp_matrix[k1, k2] is a label for the file pair (k1, k2). Note
that fp_matrix[k1, k2] = fp_matrix[k2, k1].
rp_to_fpA logical matrix that indicates which record
pairs belong to which file pairs. rp_to_fp[fp, rp] is TRUE if
the records record_pairs[rp, ] belong to the file pair fp,
and is FALSE otherwise. Note that fp is given by the labeling in
fp_matrix.
abAn integer vector, of length
ncol(comparisons) * K * (K + 1) / 2 that indicates how many record
pairs there are with a given disagreement level for a given field, for each
file pair.
file_sizes_not_includedIf according to pairs_to_keep
there are records which are not potential matches to any other records, the
remaining records are relabeled (see labels), and
file_sizes_not_included indicates, for each file, the number of such
records that were removed.
ab_not_includedFor record pairs not included according to
pairs_to_keep, this is an integer vector, of length
ncol(comparisons) * K * (K + 1) / 2 that indicates how many record
pairs there are with a given disagreement level for a given field, for each
file pair.
labelsIf according to pairs_to_keep
there are records which are not potential matches to any other records, the
remaining records are relabeled. labels provides a dictionary that
indicates, for each of the new labels, which record in the original
labeling the new label corresponds to. In particular, the first column
indicates the record in the original labeling, and the second column
indicates the new labeling.
pairs_to_keepA logical vector, the same length as
comparison_list$record_pairs, indicating which record pairs were
kept as potential matches. This may not be the same as the input
pairs_to_keep if cc was set to 1.
ccA numeric indicator of whether the connected
components of the potential matches are closed under transitivity.
The output of a call to
create_comparison_data.
A logical vector, the same length as
comparison_list$record_pairs, indicating which record pairs should be
kept as potential matches. These potential matches do not have to be
transitive (see the argument cc).
A numeric indicator of whether to find the transitive
closure of pairs_to_keep, and use these potential matches instead
of just those from pairs_to_keep. cc should be 1 if the
transitive closure is being used, and cc should be 0 if the
transitive closure is not being used. We recommend setting cc to
1.
When using comparison-based record linkage methods, scalability is a concern,
as the number of record pairs is quadratic in the number of records. In
order to address these concerns, it's common to declare certain record pairs
to not be potential matches a priori, using indexing methods. The user is
free to index using any method they like, as long as they can produce a
logical vector that indicates which record pairs are potential matches
according to their indexing method. We recommend, if the user chosen indexing
method does not output potential matches that are transitive, to set the
cc argument to 1. By transitive we mean, for any three records
\(i\), \(j\), and \(k\), if \(i\) and \(j\) are potential matches,
and \(j\) and \(k\) are potential matches, then \(i\) and \(k\) are
potential matches. Non-transitive indexing schemes can lead to poor mixing of
the Gibbs sampler used for posterior inference, and suggests that the
indexing method used may have been too stringent.
If indexing is used, it may be the case that some records are declared to not
be potential matches to any other records. In this case, the indexing method
has made the decision that these records have no matches, and thus we can
remove them from the data set and relabel the remaining records; see the
documentation for labels for information on how to go between the
original labeling and the new labeling.
If indexing is used, comparisons for record pairs that aren't potential matches are still used during inference, where they're used to inform the distribution of comparisons for non-matches.
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [tools:::Rd_expr_doi("https://doi.org/10.1080/01621459.2021.2013242")][arXiv]
# Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)
Run the code above in your browser using DataLab