# compareRecords

##### Creation of Comparison Data

Create comparison vectors for all pairs of records coming from two datafiles to be linked.

##### Usage

```
compareRecords(
df1,
df2,
flds = NULL,
flds1 = NULL,
flds2 = NULL,
types = NULL,
breaks = c(0, 0.25, 0.5)
)
```

##### Arguments

- df1, df2
two datasets to be linked, of class

`data.frame`

, with rows representing records and columns representing fields. Without loss of generality,`df1`

is assumed to have no less records than`df2`

.- flds
a vector indicating the fields to be used in the linkage. Either a

`character`

vector, in which case all entries need to be names of columns of`df1`

and`df2`

, or a`numeric`

vector indicating the columns in`df1`

and`df2`

to be used in the linkage. If provided as a`numeric`

vector it is assumed that the columns of`df1`

and`df2`

are organized such that it makes sense to compare the columns`df1[,flds]`

and`df2[,flds]`

in that order.- flds1, flds2
vectors indicating the fields of

`df1`

and`df2`

to be used in the linkage. Either`character`

vectors, in which case all entries need to be names of columns of`df1`

and`df2`

, respectively, or`numeric`

vectors indicating the columns in`df1`

and`df2`

to be used in the linkage. It is assumed that it makes sense to compare the columns`df1[,flds1]`

and`df2[,flds2]`

in that order. These arguments are ignored if`flds`

is specified. If none of`flds,flds1,flds2`

are specified, the columns with the same names in`df1`

and`df2`

are compared, if any.- types
a vector of characters indicating the comparison type per comparison field. The options are:

`"lv"`

for comparisons based on the Levenshtein edit distance normalized to \([0,1]\), with \(0\) indicating no disagreement and \(1\) indicating maximum disagreement;`"bi"`

for binary comparisons (agreement/disagreement);`"nu"`

for numeric comparisons computed as the absolute difference. The default is`"lv"`

. Fields compared with the`"lv"`

option are first transformed to`character`

class. Factors with different levels compared using the`"bi"`

option are transformed to factors with the union of the levels. Fields compared with the`"nu"`

option need to be of class`numeric`

.- breaks
break points for the comparisons to obtain levels of disagreement. It can be a list of length equal to the number of comparison fields, containing one numeric vector with the break points for each comparison field, where entries corresponding to comparison type

`"bi"`

are ignored. It can also be a named list of length two with elements 'lv' and 'nu' containing numeric vectors with the break points for all Levenshtein-based and numeric comparisons, respectively. Finally, it can be a numeric vector with the break points for all comparison fields of type`"lv"`

and`"nu"`

, which might be meaningful only if all the non-binary comparisons are of a single type, either`"lv"`

or`"nu"`

. For comparisons based on the normalized Levenshtein distance, a vector of length \(L\) of break points for the interval \([0,1]\) leads to \(L+1\) levels of disagreement. Similarly, for comparisons based on the absolute difference, the break points are for the interval \([0,\infty)\). The default is`breaks=c(0,.25,.5)`

, which might be meaningful only for comparisons of type`"lv"`

.

##### Value

a list containing:

`comparisons`

matrix with

`n1*n2`

rows, where the comparison pattern for record pair \((i,j)\) appears in row`(j-1)*n1+i`

, for \(i\) in \({1,\dots,n1}\), and \(j\) in \({1,\dots,n2}\). A comparison field with \(L+1\) levels of disagreement, is represented by \(L+1\) columns of TRUE/FALSE indicators. Missing comparisons are coded as FALSE, which is justified under an assumption of ignorability of the missing comparisons, see Sadinle (2017).`n1,n2`

the datafile sizes,

`n1 = nrow(df1)`

and`n2 = nrow(df2)`

.`nDisagLevs`

a vector containing the number of levels of disagreement per comparison field.

`compFields`

a data frame containing the names of the fields in the datafiles used in the comparisons and the types of comparison.

##### References

Mauricio Sadinle (2017). Bayesian Estimation of Bipartite Matchings for Record Linkage. *Journal of the
American Statistical Association* 112(518), 600-612. [Published] [arXiv]

##### Examples

```
# NOT RUN {
data(twoFiles)
myCompData <- compareRecords(df1, df2,
flds=c("gname", "fname", "age", "occup"),
types=c("lv","lv","bi","bi"),
breaks=c(0,.25,.5))
## same as
myCompData <- compareRecords(df1, df2, types=c("lv","lv","bi","bi"))
## let's transform 'occup' to numeric to illustrate how to obtain numeric comparisons
df1$occup <- as.numeric(df1$occup)
df2$occup <- as.numeric(df2$occup)
## using different break points for 'lv' and 'nu' comparisons
myCompData1 <- compareRecords(df1, df2,
flds=c("gname", "fname", "age", "occup"),
types=c("lv","lv","bi","nu"),
breaks=list(lv=c(0,.25,.5), nu=0:3))
## using different break points for each comparison field
myCompData2 <- compareRecords(df1, df2,
flds=c("gname", "fname", "age", "occup"),
types=c("lv","lv","bi","nu"),
breaks=list(c(0,.25,.5), c(0,.2,.4,.6), NULL, 0:3))
# }
```

*Documentation reproduced from package BRL, version 0.1.0, License: GPL-3*