RecordLinkage (version 0.4-12.4)

getParetoThreshold: Estimate Threshold from Pareto Distribution

Description

Calculates a classification threshold based on a generalized Pareto distribution (GPD) fitted to the weights distribution of the given data pairs.

Usage

getParetoThreshold(rpairs, quantil = 0.95, interval = NA)
# S4 method for RecLinkData
getParetoThreshold(rpairs, quantil = 0.95, interval = NA)
# S4 method for RLBigData
getParetoThreshold(rpairs, quantil = 0.95, interval = NA)

Value

A classification threshold.

Arguments

rpairs

A "RecLinkData" or "RLBigData" object with weights.The data for which to compute a threshold.

quantil

A real number between 0 and 1. The quantile which to compute.

interval

A numeric vector denoting the interval on which to fit a GPD.

Author

Andreas Borg, Murat Sariyar

Details

This threshold calculation is based on the assumption that the distribution of weights exhibit a `fat tail' which can be fitted by a generalized Pareto distribution (GPD). The limits of the interval which is subject to the fitting are usually determined by reviewing a mean residual life plot of the data. If the limits are not externally supplied, a MRL plot is displayed from which the endpoints can be selected by mouse input. If only one endpoint is selected or supplied, the greater endpoint is set to the maximum weight. A suitable interval is characterized by a relatively long, approximately linear segment of the plot.

References

Sariyar M., Borg A. and Pommerening M.: Controlling false match rates in record linkage using extreme value theory. Journal of Biomedical Informatics, tools:::Rd_expr_doi("10.1016/j.jbi.2011.02.008").

See Also

emWeights and epiWeights for calculating weights, emClassify and epiClassify for classifying with the returned threshold.

Examples

Run this code
  data(RLdata500)
  rpairs=compare.dedup(RLdata500, identity=identity.RLdata500, strcmp=TRUE,
    blockfld=list(1,3,5:7))
  rpairs=epiWeights(rpairs)
  # leave out argument interval to choose from plot
  if (FALSE) threshold=getParetoThreshold(rpairs,interval=c(0.68, 0.79))
  if (FALSE) summary(epiClassify(rpairs,threshold))

Run the code above in your browser using DataLab