linkRecords: Bayes Estimates of Bipartite Matchings

Description

Bayes point estimates of bipartite matchings that can be obtained in closed form according to Theorems 1, 2 and 3 of Sadinle (2017).

Usage

linkRecords(Zchain, n1, lFNM = 1, lFM1 = 1, lFM2 = 2, lR = Inf)

Arguments

Zchain

matrix as the output $Z of the function bipartiteGibbs, with n2 rows and nIter columns containing a chain of draws from a posterior distribution on bipartite matchings. Each column indicates the records in datafile 1 to which the records in datafile 2 are matched according to that draw.

number of records in datafile 1.

lFNM

individual loss of a false non-match in the loss functions of Sadinle (2017), default lFNM=1.

lFM1

individual loss of a false match of type 1 in the loss functions of Sadinle (2017), default lFM1=1.

lFM2

individual loss of a false match of type 2 in the loss functions of Sadinle (2017), default lFM2=2.

individual loss of 'rejecting' to make a decision in the loss functions of Sadinle (2017), default lR=Inf.

Value

A vector containing the point estimate of the bipartite matching. If lR != Inf the output might be a partial estimate. A number smaller or equal to n1 in entry j indicates the record in datafile 1 to which record j in datafile 2 gets linked, a number n1+j indicates that record j does not get linked to any record in datafile 1, and the value -1 indicates a 'rejection' to link, meaning that the correct linkage decision is not clear.

Details

Not all combinations of losses lFNM, lFM1, lFM2, lR are supported. The losses have to be positive numbers and satisfy one of three conditions:

Conditions of Theorem 1 of Sadinle (2017): (lR == Inf) & (lFNM <= lFM1) & (lFNM + lFM1 <= lFM2)
Conditions of Theorem 2 of Sadinle (2017): ((lFM2 >= lFM1) & (lFM1 >= 2*lR)) | ((lFM1 >= lFNM) & (lFM2 >= lFM1 + lFNM))
Conditions of Theorem 3 of Sadinle (2017): (lFM2 >= lFM1) & (lFM1 >= 2*lR) & (lFNM >= 2*lR)

If one of the last two conditions is satisfied, the point estimate might be partial, meaning that there might be some records in datafile 2 for which the point estimate does not include a linkage decision. For combinations of losses not supported here, the linear sum assignment problem outlined by Sadinle (2017) needs to be solved.

References

Mauricio Sadinle (2017). Bayesian Estimation of Bipartite Matchings for Record Linkage. Journal of the American Statistical Association 112(518), 600-612. [Published] [arXiv]

Examples

Run this code

# NOT RUN {
data(twoFiles)

myCompData <- compareRecords(df1, df2, flds=c("gname", "fname", "age", "occup"), 
                             types=c("lv","lv","bi","bi"))

chain <- bipartiteGibbs(myCompData)

## discard first 100 iterations of Gibbs sampler

## full estimate of bipartite matching (full linkage)
fullZhat <- linkRecords(chain$Z[,-c(1:100)], n1=nrow(df1), lFNM=1, lFM1=1, lFM2=2, lR=Inf)

## partial estimate of bipartite matching (partial linkage), where 
## lR=0.5, lFNM=1, lFM1=1 mean that we consider not making a decision for 
## a record as being half as bad as a false non-match or a false match of type 1
partialZhat <- linkRecords(chain$Z[,-c(1:100)], n1=nrow(df1), lFNM=1, lFM1=1, lFM2=2, lR=.5)

## for which records the decision is not clear according to this set-up of the losses? 
undecided <- which(partialZhat == -1)
df2[undecided,]

## compute frequencies of link options observed in the chain 
linkOptions <- apply(chain$Z[undecided, -c(1:100)], 1, table)
linkOptions <- lapply(linkOptions, sort, decreasing=TRUE)
linkOptionsInds <- lapply(linkOptions, names)
linkOptionsInds <- lapply(linkOptionsInds, as.numeric)
linkOptionsFreqs <- lapply(linkOptions, function(x) as.numeric(x)/sum(x))

## first record without decision
df2[undecided[1],]

## options for this record; row of NAs indicates possibility that record has no match in df1
cbind(df1[linkOptionsInds[[1]],], prob = round(linkOptionsFreqs[[1]],3) )
# }

Run the code above in your browser using DataLab