Learn R Programming

RecordLinkage (version 0.2-0)

emClassify: Weight-based Classification of Data Pairs

Description

Classifies data pairs to which weights were assigned by emWeights based on user-defined thresholds or estimated error rates.

Usage

emClassify(rpairs, threshold.upper = Inf, 
  threshold.lower = threshold.upper, my = Inf, ny = Inf)

Arguments

rpairs
RecLinkData object with weight information.
my
A probability. Error bound for false positives.
ny
A probability. Error bound for false negatives.
threshold.upper
A numeric value. Threshold for links.
threshold.lower
A numeric value. Threshold for possible links.

Value

  • Returns a RecLinkResult object containing all fields of rpairs and a factor object prediction which contains predictions and corresponds to rpairs$pairs. "L" represents a link, "N" a non-link and "P" a possible link.

Details

Two general approaches are implemented for classification. The classical procedure by Fellegi and Sunter (see references) minimizes the number of possible links with given error levels for false links (my) and false non-links (ny). The second approach requires thresholds for links and possible links to be set by the user. A pair with weight $w$ is classified as a link if $w\geq \textit{threshold.upper}$, as a possible link if $\textit{threshold.upper}\geq w\geq \textit{threshold.lower}$ and as a non-link if $w<\textit{threshold.lower}$. if="" threshold.upper or threshold.lower is given, the threshold-based approach is used, otherwise, if one of the error bounds is given, the Fellegi-Sunter model. If only my is supplied, links are chosen to meet the error bound and all other pairs are classified as non-links (the equivalent case holds if only ny is specified). If no further arguments than rpairs are given, a single threshold of 0 is used.

References

Ivan P. Fellegi, Alan B. Sunter: A Theory for Record Linkage, in: Journal of the American Statistical Association Vol. 64, No. 328 (Dec., 1969), pp. 1183--1210.

See Also

getPairs to produce output from which thresholds can be determined conveniently.