Learn R Programming

RecordLinkage (version 0.2-0)

emWeights: Calculate weights

Description

Calculates weights for Record Linkage based on the EM algorithm.

Usage

emWeights(rpairs, cutoff = 0.95, ...)

Arguments

rpairs
RecLinkData object. The record pairs for which to compute weights.
cutoff
Numeric value between 0 and 1. Cutoff value for string comparator.
...
Additional arguments for mygllm.

Value

  • A RecLinkData object containing all components of rpairs plus the following:
  • MEstimated m-probabilities
  • UEstimated u-probabilities
  • W,WdataCalculated weights
  • M, U and W correspond to a list of all binary comparison patterns, sorted ascending from all zeroes to all ones. Wdata corresponds directly to the record pairs in rpairs$pairs.

Details

The weight of a record pair is calculated by $\log_{2}\frac{M}{U}$, where $M$ and $U$ are estimated m- and u-probabilities for the present comparison pattern. If a string comparator is used, weights are first calculated based on a binary table where all comparison values greater or equal cutoff are set to one, all other to zero. The resulting weight is adjusted by adding for every pair $\log_{2}\left(\prod_{j:s^{i}_{j}\geq \textit{cutoff }}s^{i}_{j}\right)$, where $s^{i}_{j}$ is the value of the string metric for attribute j in data pair i. The appropriate value of cutoff depends on the choice of string comparator. The default is adjusted to jarowinkler, a lower value (e.g. 0.7) is recommended for levenshteinSim. Estimation of $M$ and $U$ is done by an EM algorithm, implemented by mygllm. For every comparison pattern, the estimated numbers of matches and non-matches are used to compute the corresponding probabilities. Estimations based on the average frequencies of values and given error rates are taken as initial values. In our experience, this increases stability and performance of the EM algorithm.

References

William E. Winkler: Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage, in: Proceedings of the Section on Survey Research Methods, American Statistical Association 1988, pp. 667--671.

See Also

emClassify for classification of weighted pairs.