emWeights: Calculate weights

Description

Calculates weights for Record Linkage based on an EM algorithm.

Usage

emWeights(rpairs, cutoff = 0.95, ...)

  ## S3 method for class 'RecLinkData':
emWeights(rpairs, cutoff = 0.95, ...)

  ## S3 method for class 'RLBigData':
emWeights(rpairs, cutoff = 0.95, store.weights = TRUE,
  verbose = TRUE, ...)

Arguments

rpairs

The record pairs for which to compute weights. See details.

cutoff

Either a numeric value in the range [0,1] or a vector with the same length as the number of attributes in the data. Cutoff value for string comparator.

store.weights

Logical value. Whether to store individual weights in the database

verbose

Logical. Whether to print progress messages.

...

Additional arguments passed to mygllm.

Value

A copy of rpairs with the weights attached. See the class documentation ("RecLinkData", "RLBigDataDedup" and "RLBigDataLinkage") on how weights are stored.

Side effects

The "RLBigData" method writes weights to the database belonging to object

Details

Since package version 0.3, this is a generic functions with methods for S3 objects of class "RecLinkData" as well as S4 objects of classes "RLBigDataDedup" and "RLBigDataLinkage". The weight of a record pair is calculated by $\log_{2}\frac{M}{U}$, where $M$ and $U$ are estimated m- and u-probabilities for the present comparison pattern. If a string comparator is used, weights are first calculated based on a binary table where all comparison values greater or equal cutoff are set to one, all other to zero. The resulting weight is adjusted by adding for every pair $\log_{2}\left(\prod_{j:s^{i}_{j}\geq \textit{cutoff }}s^{i}_{j}\right)$, where $s^{i}_{j}$ is the value of the string metric for attribute j in data pair i. The appropriate value of cutoff depends on the choice of string comparator. The default is adjusted to jarowinkler, a lower value (e.g. 0.7) is recommended for levenshteinSim. Estimation of $M$ and $U$ is done by an EM algorithm, implemented by mygllm. For every comparison pattern, the estimated numbers of matches and non-matches are used to compute the corresponding probabilities. Estimations based on the average frequencies of values and given error rates are taken as initial values. In our experience, this increases stability and performance of the EM algorithm. The "RLBigData" method writes the individual weight for every record pairs into the database if called with store.weights=TRUE. This speeds up subsequent calls of the classification function emClassify and is in general recommended if several classification calls are to be made (e.g. for testing different thresholds). However, if a very large number of record pairs is processed, saving individual weights can lead to excessive disk usage; in this case store.weights = FALSE may be a better choice. Subsequent calls to emClassify will then calculate individual weights on the fly during classification without saving them. Some progress messages are printed to the message stream (see message if verbose == TRUE. This includes progress bars, but these are supressed if output is diverted by sink to avoid cluttering the output file.

References

William E. Winkler: Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage, in: Proceedings of the Section on Survey Research Methods, American Statistical Association 1988, pp. 667--671.