emClassify: Weight-based Classification of Data Pairs

Description

Classifies data pairs to which weights were assigned by emWeights. Based on user-defined thresholds or predefined error rates.

Usage

emClassify(rpairs, threshold.upper = Inf,
    threshold.lower = threshold.upper, my = Inf, ny = Inf, ...)
  # S4 method for RecLinkData,ANY,ANY
emClassify(rpairs, threshold.upper = Inf,
    threshold.lower = threshold.upper, my = Inf, ny = Inf)
  # S4 method for RLBigData,ANY,ANY
emClassify(rpairs, threshold.upper = Inf,
    threshold.lower = threshold.upper, my = Inf, ny = Inf,
    withProgressBar = (sink.number()==0))

Arguments

rpairs

RecLinkData object with weight information.

A probability. Error bound for false positives.

A probability. Error bound for false negatives.

threshold.upper

A numeric value. Threshold for links.

threshold.lower

A numeric value. Threshold for possible links.

withProgressBar

Whether to display a progress bar

...

Placeholder for method-specific arguments.

Value

For the "RecLinkData" method, a S3 object of class "RecLinkResult" that represents a copy of newdata with element rpairs$prediction, which stores the classification result, as addendum.

For the "'>RLBigData" method, a S4 object of class "'>RLResult".

Details

Two general approaches are implemented. The classical procedure by Fellegi and Sunter (see references) minimizes the number of possible links with given error levels for false links (my) and false non-links (ny).

The second approach requires thresholds for links and possible links to be set by the user. A pair with weight $w$ is classified as a link if $w\geq \textit{threshold.upper}$, as a possible link if $\textit{threshold.upper}\geq w\geq \textit{threshold.lower}$ and as a non-link if $w<\textit{threshold.lower}$.

If threshold.upper or threshold.lower is given, the threshold-based approach is used, otherwise, if one of the error bounds is given, the Fellegi-Sunter model. If only my is supplied, links are chosen to meet the error bound and all other pairs are classified as non-links (the equivalent case holds if only ny is specified). If no further arguments than rpairs are given, a single threshold of 0 is used.

References