emWeights(rpairs, cutoff = 0.95, ...)
## S3 method for class 'RecLinkData':
emWeights(rpairs, cutoff = 0.95, ...)
## S3 method for class 'RLBigData':
emWeights(rpairs, cutoff = 0.95, store.weights = TRUE,
verbose = TRUE, ...)
mygllm
.rpairs
with the weights attached. See the class documentation
("RecLinkData"
, "RLBigDataDedup "
and
"RLBigDataLinkage "
) on how weights are stored."RLBigData "
method writes weights to the database
belonging to object
"RecLinkData"
as well as S4 objects
of classes "RLBigDataDedup "
and
"RLBigDataLinkage "
.
The weight of a record pair is calculated by $\log_{2}\frac{M}{U}$, where $M$ and $U$ are estimated m- and u-probabilities
for the present comparison pattern. If a string comparator is used, weights
are first calculated based on a binary table where all comparison
values greater or equal cutoff
are set to one, all other to zero.
The resulting weight is adjusted by adding for every pair
$\log_{2}\left(\prod_{j:s^{i}_{j}\geq \textit{cutoff }}s^{i}_{j}\right)$, where
$s^{i}_{j}$ is the value of the string metric for attribute j in
data pair i.
The appropriate value of cutoff
depends on the choice of string
comparator. The default is adjusted to jarowinkler
,
a lower value (e.g. 0.7) is recommended for levenshteinSim
.
Estimation of $M$ and $U$ is done by an EM algorithm, implemented by
mygllm
. For every comparison
pattern, the estimated numbers of matches and non-matches are used to compute
the corresponding probabilities. Estimations based on the average
frequencies of values and given error rates are taken as initial values.
In our experience, this increases stability and performance of the
EM algorithm.
The "RLBigData "
method writes the individual weight
for every record pairs into the database if called with
store.weights=TRUE
. This speeds up subsequent calls of the
classification function emClassify
and is in general recommended
if several classification calls are to be made (e.g. for testing different
thresholds). However, if a very large number of record pairs is processed,
saving individual weights can lead to excessive disk usage; in this case
store.weights = FALSE
may be a better choice. Subsequent calls to
emClassify
will then calculate individual weights on the fly
during classification without saving them.
Some progress messages are printed to the message stream (see
message
if verbose == TRUE
.
This includes progress bars, but these are supressed if output is diverted by
sink
to avoid cluttering the output file.emClassify
for classification of weighted pairs.
epiWeights
for a different approach for weight calculation.