Create comparison data for all pairs of records, except for those records in files which are assumed to have no duplicates.
create_comparison_data(
records,
types,
breaks,
file_sizes,
duplicates,
verbose = TRUE
)
a list containing:
record_pairs
A data.frame
, where each row
contains the pair of records being compared in the corresponding row of
comparisons
. The rows are sorted in ascending order according to the
first column, with ties broken according to the second column in ascending
order. For any given row, the first column is less than the second column,
i.e. record_pairs[i, 1] < record_pairs[i, 2]
for each row i
.
comparisons
A logical
matrix, where each row contains
the comparisons for the record pair in the corresponding row of
record_pairs
. Comparisons are in the same order as the columns of
records
, and are represented by L + 1
columns of
TRUE/FALSE
indicators, where L + 1
is the number of
disagreement levels for the field based on breaks
.
K
The number of files, assumed to be of class
numeric
.
file_sizes
A numeric
vector of length K
,
indicating the size of each file.
duplicates
A numeric
vector of length K
,
indicating which files are assumed to have duplicates. duplicates[k]
should be 1
if file k
has duplicates, and
duplicates[k]
should be 0
if file k
has no duplicates.
If any files do not have duplicates, we strongly recommend that the largest
such file is organized to be the first file.
field_levels
A numeric
vector indicating the number of
disagreement levels for each field.
file_labels
An integer
vector of length
sum(file_sizes)
, where file_labels[i]
indicates which file
record i
is in.
fp_matrix
An integer
matrix, where
fp_matrix[k1, k2]
is a label for the file pair (k1, k2)
. Note
that fp_matrix[k1, k2] = fp_matrix[k2, k1]
.
rp_to_fp
A logical
matrix that indicates which record
pairs belong to which file pairs. rp_to_fp[fp, rp]
is TRUE
if
the records record_pairs[rp, ]
belong to the file pair fp
,
and is FALSE otherwise. Note that fp
is given by the labeling in
fp_matrix
.
ab
An integer
vector, of length
ncol(comparisons) * K * (K + 1) / 2
that indicates how many record
pairs there are with a given disagreement level for a given field, for each
file pair.
file_sizes_not_included
A numeric
vector of 0
s.
This element is non-zero when reduce_comparison_data
is
used.
ab_not_included
A numeric
vector of 0
s. This
element is non-zero when reduce_comparison_data
is used.
labels
NA
. This element is not NA
when
reduce_comparison_data
is used.
pairs_to_keep
NA
. This element is not NA
when
reduce_comparison_data
is used.
cc
0
. This element is non-zero when
reduce_comparison_data
is used.
A data.frame
containing the records to be linked, where
each column of records
is a field to be compared. If there are
multiple files, records
should be obtained by stacking the files on
top of each other so that records[1:file_sizes[1], ]
contains the
records for file 1
,
records[(file_sizes[1] + 1):(file_sizes[1] + file_sizes[2]), ]
contains the records for file 2
, and so on. Missing values should be
coded as NA
.
A character
vector, indicating the comparison to be used
for each field (i.e. each column of records
). The options are:
"bi"
for binary comparisons, "nu"
for numeric comparisons
(absolute difference), "lv"
for string comparisons (normalized
Levenshtein distance), "lv_sep"
for string comparisons (normalized
Levenshtein distance) where each string may contain multiple spellings
separated by the "|" character. We assume that fields using options
"bi"
, "lv"
, and "lv_sep"
are of class
character
, and fields using the "nu"
option are of class
numeric
. For fields using the "lv_sep"
option, for each record
pair the normalized Levenshtein distance is computed between each possible
spelling, and the minimum normalized Levenshtein distance between spellings
is then used as the comparison for that record pair.
A list
, the same length as types
, indicating the
break points used to compute disagreement levels for each fields'
comparisons. If types[f]="bi"
, breaks[[f]]
is ignored (and thus
can be set to NA
). See Details for more information on specifying this
argument.
A numeric
vector indicating the size of each file.
A numeric
vector indicating which files are assumed
to have duplicates. duplicates[k]
should be 1
if file k
has duplicates, and duplicates[k]
should be 0
if file k
has no duplicates. If any files do not have duplicates, we strongly recommend
that the largest such file is organized to be the first file.
A logical
indicator of whether progress messages should
be print (default TRUE
).
The purpose of this function is to construct comparison vectors for each pair
of records. In order to construct these vectors, one needs to specify the
types
and breaks
arguments. The types
argument specifies
how each field should be compared, and the breaks
argument specifies
how to discretize these comparisons.
Currently, the types
argument supports three types of field
comparisons: binary, absolute difference, and the normalized Levenshtein
distance. Please contact the package maintainer if you need a new type of
comparison to be supported.
The breaks
argument should be a list
, with with one element for
each field. If a field is being compared with a binary comparison, i.e.
types[f]="bi"
, then the corresponding element of breaks
should
be NA
, i.e. breaks[[f]]=NA
. If a field is being compared with a
numeric or string comparison, then the corresponding element of breaks
should be a vector of cut points used to discretize the comparisons. To give
more detail, suppose you pass in cut points
breaks[[f]]=c(cut_1, ...,cut_L)
. These cut points
discretize the range of the comparisons into L+1
intervals:
\(I_0=(-\infty, cut_1], I_1=(cut_1, cut_2], ..., I_L=(cut_L, \infty]\). The
raw comparisons, which lie in \([0,\infty)\) for numeric comparisons and
\([0,1]\) for string comparisons, are then replaced with indicators of
which interval the comparisons lie in. The interval \(I_0\) corresponds to
the lowest level of disagreement for a comparison, while the interval
\(I_L\) corresponds to the highest level of disagreement for a comparison.
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [tools:::Rd_expr_doi("https://doi.org/10.1080/01621459.2021.2013242")][arXiv]
## Example with small no duplicate dataset
data(no_dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = no_dup_data_small$file_sizes,
duplicates = c(0, 0, 0))
## Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
Run the code above in your browser using DataLab