optimalThreshold: Optimal Threshold for Record Linkage

Description

Calculates the optimal threshold for weight-based Record Linkage.

Usage

optimalThreshold(rpairs, my = NaN, ny = NaN)
# S4 method for RecLinkData
optimalThreshold(rpairs, my = NaN, ny = NaN)
# S4 method for RLBigData
optimalThreshold(rpairs, my = NaN, ny = NaN)

Arguments

rpairs

Record pairs for which to calculate a threshold.

A real value in the range [0,1]. Error bound for false positives.

A real value in the range [0,1]. Error bound for false negatives.

Value

A numeric value, the calculated threshold.

Details

Weights must have been calculated for rpairs, for example by emWeights or epiWeights. The true match result must be known for rpairs, mostly this is provided through the identity argument of compare.*

For the following, it is assumed that all records with weights greater than or equal to the threshold are classified as links, the remaining as non-links. If no further arguments are given, a threshold which minimizes the absolute number of misclassified record pairs is returned. If my is supplied (ny is ignored in this case), a threshold is picked which maximizes the number of correctly classified links while keeping the ratio of false links to the total number of links below or equal my. If ny is supplied, the number of correct non-links is maximized under the condition that the ratio of falsely classified non-links to the total number of non-links does not exceed ny.

Two seperate runs of optimalThreshold with values for my and ny respectively allow for obtaining a lower and an upper threshold for a three-way classification approach (yielding links, non-links and possible links).

Examples

Run this code

# NOT RUN {
# create record pairs
data(RLdata500)
p=compare.dedup(RLdata500,identity=identity.RLdata500, strcmp=TRUE,
  strcmpfun=levenshteinSim)

# calculate weights
p=epiWeights(p)

# split record pairs in two sets
l=splitData(dataset=p, prop=0.5, keep.mprop=TRUE)

# get threshold from training set
threshold=optimalThreshold(l$train)

# classify remaining data
summary(epiClassify(l$valid,threshold))
# }

Run the code above in your browser using DataLab