compareRecords: Creation of Comparison Data

Description

Create comparison vectors for all pairs of records coming from two datafiles to be linked.

Usage

compareRecords(
  df1,
  df2,
  flds = NULL,
  flds1 = NULL,
  flds2 = NULL,
  types = NULL,
  breaks = c(0, 0.25, 0.5)
)

Arguments

df1, df2

two datasets to be linked, of class data.frame, with rows representing records and columns representing fields. Without loss of generality, df1 is assumed to have no less records than df2.

flds

a vector indicating the fields to be used in the linkage. Either a character vector, in which case all entries need to be names of columns of df1 and df2, or a numeric vector indicating the columns in df1 and df2 to be used in the linkage. If provided as a numeric vector it is assumed that the columns of df1 and df2 are organized such that it makes sense to compare the columns df1[,flds] and df2[,flds] in that order.

flds1, flds2

vectors indicating the fields of df1 and df2 to be used in the linkage. Either character vectors, in which case all entries need to be names of columns of df1 and df2, respectively, or numeric vectors indicating the columns in df1 and df2 to be used in the linkage. It is assumed that it makes sense to compare the columns df1[,flds1] and df2[,flds2] in that order. These arguments are ignored if flds is specified. If none of flds,flds1,flds2 are specified, the columns with the same names in df1 and df2 are compared, if any.

types

a vector of characters indicating the comparison type per comparison field. The options are: "lv" for comparisons based on the Levenshtein edit distance normalized to \([0,1]\), with \(0\) indicating no disagreement and \(1\) indicating maximum disagreement; "bi" for binary comparisons (agreement/disagreement); "nu" for numeric comparisons computed as the absolute difference. The default is "lv". Fields compared with the "lv" option are first transformed to character class. Factors with different levels compared using the "bi" option are transformed to factors with the union of the levels. Fields compared with the "nu" option need to be of class numeric.

breaks

break points for the comparisons to obtain levels of disagreement. It can be a list of length equal to the number of comparison fields, containing one numeric vector with the break points for each comparison field, where entries corresponding to comparison type "bi" are ignored. It can also be a named list of length two with elements 'lv' and 'nu' containing numeric vectors with the break points for all Levenshtein-based and numeric comparisons, respectively. Finally, it can be a numeric vector with the break points for all comparison fields of type "lv" and "nu", which might be meaningful only if all the non-binary comparisons are of a single type, either "lv" or "nu". For comparisons based on the normalized Levenshtein distance, a vector of length \(L\) of break points for the interval \([0,1]\) leads to \(L+1\) levels of disagreement. Similarly, for comparisons based on the absolute difference, the break points are for the interval \([0,\infty)\). The default is breaks=c(0,.25,.5), which might be meaningful only for comparisons of type "lv".

Value

a list containing:

comparisons: matrix with n1*n2 rows, where the comparison pattern for record pair \((i,j)\) appears in row (j-1)*n1+i, for \(i\) in \({1,\dots,n1}\), and \(j\) in \({1,\dots,n2}\). A comparison field with \(L+1\) levels of disagreement, is represented by \(L+1\) columns of TRUE/FALSE indicators. Missing comparisons are coded as FALSE, which is justified under an assumption of ignorability of the missing comparisons, see Sadinle (2017).
n1,n2: the datafile sizes, n1 = nrow(df1) and n2 = nrow(df2).
nDisagLevs: a vector containing the number of levels of disagreement per comparison field.
compFields: a data frame containing the names of the fields in the datafiles used in the comparisons and the types of comparison.

References

Mauricio Sadinle (2017). Bayesian Estimation of Bipartite Matchings for Record Linkage. Journal of the American Statistical Association 112(518), 600-612. [Published] [arXiv]

Examples

Run this code

# NOT RUN {
data(twoFiles)

myCompData <- compareRecords(df1, df2, 
                             flds=c("gname", "fname", "age", "occup"),
                             types=c("lv","lv","bi","bi"), 
                             breaks=c(0,.25,.5))

## same as 
myCompData <- compareRecords(df1, df2, types=c("lv","lv","bi","bi"))


## let's transform 'occup' to numeric to illustrate how to obtain numeric comparisons 
df1$occup <- as.numeric(df1$occup)
df2$occup <- as.numeric(df2$occup)

## using different break points for 'lv' and 'nu' comparisons 
myCompData1 <- compareRecords(df1, df2, 
                              flds=c("gname", "fname", "age", "occup"),
                              types=c("lv","lv","bi","nu"), 
                              breaks=list(lv=c(0,.25,.5), nu=0:3))

## using different break points for each comparison field
myCompData2 <- compareRecords(df1, df2, 
                              flds=c("gname", "fname", "age", "occup"),
                              types=c("lv","lv","bi","nu"), 
                              breaks=list(c(0,.25,.5), c(0,.2,.4,.6), NULL, 0:3))
# }

Run the code above in your browser using DataLab