emClassify: Weight-based Classification of Data Pairs
Description
Classifies data pairs to which weights were assigned by emWeights
based on user-defined thresholds or estimated error rates.Usage
emClassify(rpairs, threshold.upper = Inf,
threshold.lower = threshold.upper, my = Inf, ny = Inf)
Arguments
my
A probability. Error bound for false positives.
ny
A probability. Error bound for false negatives.
threshold.upper
A numeric value. Threshold for links.
threshold.lower
A numeric value. Threshold for possible links.
Value
- Returns a
RecLinkResult object containing all fields of rpairs
and a factor object prediction which contains predictions and corresponds
to rpairs$pairs. "L" represents a link, "N" a non-link and
"P" a possible link.
Details
Two general approaches are implemented for classification. The classical procedure
by Fellegi and Sunter (see references) minimizes the number of
possible links with given error levels for false links (my) and
false non-links (ny).
The second approach requires thresholds for links and possible links to be set
by the user. A pair with weight $w$ is classified as a link if
$w\geq \textit{threshold.upper}$, as a possible link if
$\textit{threshold.upper}\geq w\geq \textit{threshold.lower}$ and as a non-link if $w<\textit{threshold.lower}$. if="" threshold.upper or threshold.lower is given, the
threshold-based approach is used, otherwise, if one of the error bounds is
given, the Fellegi-Sunter model. If only my is supplied, links are
chosen to meet the error bound and all other pairs are classified as non-links
(the equivalent case holds if only ny is specified). If no further arguments
than rpairs are given, a single threshold of 0 is used.\textit{threshold.lower}$.>References
Ivan P. Fellegi, Alan B. Sunter: A Theory for Record Linkage,
in: Journal of the American Statistical Association Vol. 64, No. 328
(Dec., 1969), pp. 1183--1210.See Also
getPairs to produce output from which thresholds can
be determined conveniently.