emWeights(rpairs, cutoff = 0.95, ...)RecLinkData object. The record pairs for which to
compute weights.mygllm.RecLinkData object containing all components of rpairs
plus the following:M, U and W correspond to a list of all binary comparison
patterns, sorted ascending from all zeroes to all ones. Wdata
corresponds directly to the record pairs in rpairs$pairs.cutoff are set to one, all other to zero.
The resulting weight is adjusted by adding for every pair
$\log_{2}\left(\prod_{j:s^{i}_{j}\geq \textit{cutoff }}s^{i}_{j}\right)$, where
$s^{i}_{j}$ is the value of the string metric for attribute j in
data pair i.
The appropriate value of cutoff depends on the choice of string
comparator. The default is adjusted to jarowinkler,
a lower value (e.g. 0.7) is recommended for levenshteinSim.
Estimation of $M$ and $U$ is done by an EM algorithm, implemented by
mygllm. For every comparison
pattern, the estimated numbers of matches and non-matches are used to compute
the corresponding probabilities. Estimations based on the average
frequencies of values and given error rates are taken as initial values.
In our experience, this increases stability and performance of the
EM algorithm.emClassify for classification of weighted pairs.