Learn R Programming

RecordLinkage (version 0.2-0)

getParetoThreshold: Estimate Threshold from Pareto Distribution

Description

Calculates a classification threshold based on a generalized Pareto distribution (GPD) fitted to the weights of the given data pairs.

Usage

getParetoThreshold(rpairs, quantil = 0.95, interval = NA)

Arguments

rpairs
A RecLinkData object with weights. The data for which to compute a threshold.
quantil
A real number between 0 and 1. The quantil which to compute.
interval
A numeric vector denoting the interval on which to fit a GPD.

Value

  • The resulting threshold.

encoding

latin1

Details

This threshold calculation is based on the assumption that weights in the `middle' range (the region of 'possible links' in classical Record Linkage) form a `fat tail' and can be fitted to a generalized Pareto distribution (GPD). The limits of the interval which is subject to fitting are usually determined by reviewing a mean residual life plot of the data. If not supplied, a MRL plot is displayed from which the endpoints can be selected by mouse input. If only one endpoint is selected or supplied, the greater endpoint is set to the maximum weight. A suitable interval is characterized by a relatively long, approximately linear segment of the plot. For weights computed by emWeights, it is usally located around 0, for weights computed by epiWeights between $0.5$ and $1$.

References

Sariyar et al.: Bestimmung der False Match-Rate im Fellegi-Sunter-Modell mittels verallgemeinerte Paretoverteilung, Presentation for 54. Jahrestagung der Deutschen Gesellschaft f�r Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS).

See Also

emWeights and epiWeights for calculating weights, emClassify and epiClassify for classifying with the returned threshold.

Examples

Run this code
data(RLdata500)
  rpairs=compare.dedup(RLdata500, identity=identity.RLdata500, strcmp=TRUE,
    blockfld=list(1,3,5,6,7))
  rpairs=epiWeights(rpairs)
  # leave out argument interval to choose from plot
  threshold=getParetoThreshold(rpairs,interval=c(0.68, 0.79))
  summary(epiClassify(rpairs,threshold))

Run the code above in your browser using DataLab