optimalThreshold: Optimal Threshold for Record Linkage

Description

Calculates the optimal threshold for weight-based Record Linkage.

Usage

optimalThreshold(rpairs, my = NaN, ny = NaN)

Arguments

rpairs

RecLinkData object. Record pairs for which to calculate a threshold.

A real value in the range [0,1]. Error bound for false positives.

A real value in the range [0,1]. Error bound for false negatives.

Value

A numeric value, the calculated threshold.

Details

rpairs must contain weights in rpairs$Wdata, calculated by a suitable function such as emWeights or epiWeights. The true match result must be known for rpairs. For the following, it is assumed that all records with weights greater than or equal to the threshold are classified as links, the remaining as non-links. If no further arguments are given, a threshold which minimizes the absolute number of misclassified record pairs is returned. If my is supplied (ny is ignored in this case), a threshold is picked which maximizes the number of correctly classified links while keeping the ratio of false links to the total number of links below or equal my. If ny is supplied, the number of correct non-links is maximized under the condition that the ratio of falsely classified non-links to the total number of non-links does not exceed ny. Two seperate runs of optimalThreshold with values for my and ny respectively allow for obtaining a lower and an upper threshold for a three-way classification approach (yielding links, non-links and possible links).

Examples

Run this code

# create record pairs
data(RLdata500)
p=compare.dedup(RLdata500,identity=identity.RLdata500, strcmp=TRUE,
  strcmpfun=levenshteinSim)

# calculate weights
p=epiWeights(p)

# split record pairs in two sets
l=splitData(dataset=p, prop=0.5, keep.mprop=TRUE)

# get threshold from training set
threshold=optimalThreshold(l$train)

# classify remaining data
summary(epiClassify(l$valid,threshold))

Run the code above in your browser using DataLab