# linkRecords

##### Bayes Estimates of Bipartite Matchings

Bayes point estimates of bipartite matchings that can be obtained in closed form according to Theorems 1, 2 and 3 of Sadinle (2017).

##### Usage

`linkRecords(Zchain, n1, lFNM = 1, lFM1 = 1, lFM2 = 2, lR = Inf)`

##### Arguments

- Zchain
matrix as the output

`$Z`

of the function`bipartiteGibbs`

, with`n2`

rows and`nIter`

columns containing a chain of draws from a posterior distribution on bipartite matchings. Each column indicates the records in datafile 1 to which the records in datafile 2 are matched according to that draw.- n1
number of records in datafile 1.

- lFNM
individual loss of a false non-match in the loss functions of Sadinle (2017), default

`lFNM=1`

.- lFM1
individual loss of a false match of type 1 in the loss functions of Sadinle (2017), default

`lFM1=1`

.- lFM2
individual loss of a false match of type 2 in the loss functions of Sadinle (2017), default

`lFM2=2`

.- lR
individual loss of 'rejecting' to make a decision in the loss functions of Sadinle (2017), default

`lR=Inf`

.

##### Details

Not all combinations of losses `lFNM, lFM1, lFM2, lR`

are supported. The losses have to be positive numbers and satisfy one of three conditions:

Conditions of Theorem 1 of Sadinle (2017):

`(lR == Inf) & (lFNM <= lFM1) & (lFNM + lFM1 <= lFM2)`

Conditions of Theorem 2 of Sadinle (2017):

`((lFM2 >= lFM1) & (lFM1 >= 2*lR)) | ((lFM1 >= lFNM) & (lFM2 >= lFM1 + lFNM))`

Conditions of Theorem 3 of Sadinle (2017):

`(lFM2 >= lFM1) & (lFM1 >= 2*lR) & (lFNM >= 2*lR)`

If one of the last two conditions is satisfied, the point estimate might be partial, meaning that there might be some records in datafile 2 for which the point estimate does not include a linkage decision. For combinations of losses not supported here, the linear sum assignment problem outlined by Sadinle (2017) needs to be solved.

##### Value

A vector containing the point estimate of the bipartite matching. If `lR != Inf`

the output might be a partial estimate.
A number smaller or equal to `n1`

in entry `j`

indicates the record in datafile 1 to which record `j`

in datafile 2
gets linked, a number `n1+j`

indicates that record `j`

does not get linked to any record in datafile 1, and the value `-1`

indicates a 'rejection' to link, meaning that the correct linkage decision is not clear.

##### References

Mauricio Sadinle (2017). Bayesian Estimation of Bipartite Matchings for Record Linkage. *Journal of the
American Statistical Association* 112(518), 600-612. [Published] [arXiv]

##### Examples

```
# NOT RUN {
data(twoFiles)
myCompData <- compareRecords(df1, df2, flds=c("gname", "fname", "age", "occup"),
types=c("lv","lv","bi","bi"))
chain <- bipartiteGibbs(myCompData)
## discard first 100 iterations of Gibbs sampler
## full estimate of bipartite matching (full linkage)
fullZhat <- linkRecords(chain$Z[,-c(1:100)], n1=nrow(df1), lFNM=1, lFM1=1, lFM2=2, lR=Inf)
## partial estimate of bipartite matching (partial linkage), where
## lR=0.5, lFNM=1, lFM1=1 mean that we consider not making a decision for
## a record as being half as bad as a false non-match or a false match of type 1
partialZhat <- linkRecords(chain$Z[,-c(1:100)], n1=nrow(df1), lFNM=1, lFM1=1, lFM2=2, lR=.5)
## for which records the decision is not clear according to this set-up of the losses?
undecided <- which(partialZhat == -1)
df2[undecided,]
## compute frequencies of link options observed in the chain
linkOptions <- apply(chain$Z[undecided, -c(1:100)], 1, table)
linkOptions <- lapply(linkOptions, sort, decreasing=TRUE)
linkOptionsInds <- lapply(linkOptions, names)
linkOptionsInds <- lapply(linkOptionsInds, as.numeric)
linkOptionsFreqs <- lapply(linkOptions, function(x) as.numeric(x)/sum(x))
## first record without decision
df2[undecided[1],]
## options for this record; row of NAs indicates possibility that record has no match in df1
cbind(df1[linkOptionsInds[[1]],], prob = round(linkOptionsFreqs[[1]],3) )
# }
```

*Documentation reproduced from package BRL, version 0.1.0, License: GPL-3*