Learn R Programming

RecordLinkage (version 0.2-0)

epiWeights: Calculate EpiLink weights

Description

Calculates weights for record pairs based on the EpiLink approach (see references).

Usage

epiWeights(rpairs, e = 0.01, f = rpairs$frequencies)

Arguments

rpairs
RecLinkData object. Record pairs for which weights are to be calculated.
e
Numeric vector. Estimated error rate(s).
f
Numeric vector. Average frequency of attribute values.

Value

  • A copy of rpairs with the calculated weights stored in component rpairs$Wdata.

Details

This function calculates weights for record pairs based on the approach used by Contiero et al. in the EpiLink record linkage software (see references). The weight for a record pair $(x^{1},x^{2})$ is computed by the formula $$\frac{\sum_{i}w_{i}s(x^{1}_{i},x^{2}_{i})}{\sum_{i}w_{i}}$$ where $s(x^{1}_{i},x^{2}_{i})$ is the value of a string comparison of records $x^{1}$ and $x^{2}$ in the i-th field and $w_{i}$ is a weighting factor computed by $$w_{i}=\log_{2}(1-e_{i})/f_{i}$$ where $f_{i}$ denotes the average frequency of values and $e_{i}$ the estimated error rate for field $i$. String comparison values are taken from the record pairs as they were generated with compare.dedup or compare.dedup. The use of binary patterns is possible, but in general yields poor results. The average frequency of values is by default taken from the object rpairs. Both frequency and error rate e can be set to a single value, which will be recycled, or to a vector with distinct error rates for every field. The error rate(s) and frequencie(s) must satisfy $e_{i}\leq{}1-f_{i}$ for all $i$, otherwise the functions fails. Also, some other rare combinations can result in weights with illegal values (NaN, less than 0 or greater than 1). In this case a warning is issued.

References

P. Contiero et al., The EpiLink record linkage software, in: Methods of Information in Medicine 2005, 44 (1), 66--71.

See Also

epiClassify for classification based on EpiLink weights.

Examples

Run this code
# generate record pairs
data(RLdata500)
p=compare.dedup(RLdata500,strcmp=TRUE ,strcmpfun=levenshteinSim,
  identity=identity.RLdata500)

# calculate weights
p=epiWeights(p)

# classify and show results
summary(epiClassify(p,0.6))

Run the code above in your browser using DataLab