create_comparison_data: Create Comparison Data

Description

Create comparison data for all pairs of records, except for those records in files which are assumed to have no duplicates.

Usage

create_comparison_data(
  records,
  types,
  breaks,
  file_sizes,
  duplicates,
  verbose = TRUE
)

Value

a list containing:

record_pairs: A data.frame, where each row contains the pair of records being compared in the corresponding row of comparisons. The rows are sorted in ascending order according to the first column, with ties broken according to the second column in ascending order. For any given row, the first column is less than the second column, i.e. record_pairs[i, 1] < record_pairs[i, 2] for each row i.
comparisons: A logical matrix, where each row contains the comparisons for the record pair in the corresponding row of record_pairs. Comparisons are in the same order as the columns of records, and are represented by L + 1 columns of TRUE/FALSE indicators, where L + 1 is the number of disagreement levels for the field based on breaks.
K: The number of files, assumed to be of class numeric.
file_sizes: A numeric vector of length K, indicating the size of each file.
duplicates: A numeric vector of length K, indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates. If any files do not have duplicates, we strongly recommend that the largest such file is organized to be the first file.
field_levels: A numeric vector indicating the number of disagreement levels for each field.
file_labels: An integer vector of length sum(file_sizes), where file_labels[i] indicates which file record i is in.
fp_matrix: An integer matrix, where fp_matrix[k1, k2] is a label for the file pair (k1, k2). Note that fp_matrix[k1, k2] = fp_matrix[k2, k1].
rp_to_fp: A logical matrix that indicates which record pairs belong to which file pairs. rp_to_fp[fp, rp] is TRUE if the records record_pairs[rp, ] belong to the file pair fp, and is FALSE otherwise. Note that fp is given by the labeling in fp_matrix.
ab: An integer vector, of length ncol(comparisons) * K * (K + 1) / 2 that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair.
file_sizes_not_included: A numeric vector of 0s. This element is non-zero when reduce_comparison_data is used.
ab_not_included: A numeric vector of 0s. This element is non-zero when reduce_comparison_data is used.
labels: NA. This element is not NA when reduce_comparison_data is used.
pairs_to_keep: NA. This element is not NA when reduce_comparison_data is used.
cc: 0. This element is non-zero when reduce_comparison_data is used.

Arguments

records: A data.frame containing the records to be linked, where each column of records is a field to be compared. If there are multiple files, records should be obtained by stacking the files on top of each other so that records[1:file_sizes[1], ] contains the records for file 1, records[(file_sizes[1] + 1):(file_sizes[1] + file_sizes[2]), ] contains the records for file 2, and so on. Missing values should be coded as NA.
types: A character vector, indicating the comparison to be used for each field (i.e. each column of records). The options are: "bi" for binary comparisons, "nu" for numeric comparisons (absolute difference), "lv" for string comparisons (normalized Levenshtein distance), "lv_sep" for string comparisons (normalized Levenshtein distance) where each string may contain multiple spellings separated by the "|" character. We assume that fields using options "bi", "lv", and "lv_sep" are of class character, and fields using the "nu" option are of class numeric. For fields using the "lv_sep" option, for each record pair the normalized Levenshtein distance is computed between each possible spelling, and the minimum normalized Levenshtein distance between spellings is then used as the comparison for that record pair.
breaks: A list, the same length as types, indicating the break points used to compute disagreement levels for each fields' comparisons. If types[f]="bi", breaks[[f]] is ignored (and thus can be set to NA). See Details for more information on specifying this argument.
file_sizes: A numeric vector indicating the size of each file.
duplicates: A numeric vector indicating which files are assumed to have duplicates. duplicates[k] should be 1 if file k has duplicates, and duplicates[k] should be 0 if file k has no duplicates. If any files do not have duplicates, we strongly recommend that the largest such file is organized to be the first file.
verbose: A logical indicator of whether progress messages should be print (default TRUE).

Details

The purpose of this function is to construct comparison vectors for each pair of records. In order to construct these vectors, one needs to specify the types and breaks arguments. The types argument specifies how each field should be compared, and the breaks argument specifies how to discretize these comparisons.

Currently, the types argument supports three types of field comparisons: binary, absolute difference, and the normalized Levenshtein distance. Please contact the package maintainer if you need a new type of comparison to be supported.

The breaks argument should be a list, with with one element for each field. If a field is being compared with a binary comparison, i.e. types[f]="bi", then the corresponding element of breaks should be NA, i.e. breaks[[f]]=NA. If a field is being compared with a numeric or string comparison, then the corresponding element of breaks should be a vector of cut points used to discretize the comparisons. To give more detail, suppose you pass in cut points breaks[[f]]=c(cut_1, ...,cut_L). These cut points discretize the range of the comparisons into L+1 intervals: \(I_0=(-\infty, cut_1], I_1=(cut_1, cut_2], ..., I_L=(cut_L, \infty]\). The raw comparisons, which lie in \([0,\infty)\) for numeric comparisons and \([0,1]\) for string comparisons, are then replaced with indicators of which interval the comparisons lie in. The interval \(I_0\) corresponds to the lowest level of disagreement for a comparison, while the interval \(I_L\) corresponds to the highest level of disagreement for a comparison.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [tools:::Rd_expr_doi("https://doi.org/10.1080/01621459.2021.2013242")][arXiv]

Examples

Run this code

## Example with small no duplicate dataset
data(no_dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = no_dup_data_small$file_sizes,
 duplicates = c(0, 0, 0))

## Example with small duplicate dataset
data(dup_data_small)

# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
 types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
 breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),
               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA),
 file_sizes = dup_data_small$file_sizes,
 duplicates = c(1, 1, 1))

Run the code above in your browser using DataLab