epiWeights(rpairs, e = 0.01, f, ...)
## S3 method for class 'RecLinkData':
epiWeights(rpairs, e = 0.01, f = rpairs$frequencies)
## S3 method for class 'RLBigData':
epiWeights(rpairs, e = 0.01, f = getFrequencies(rpairs),
withProgressBar = (sink.number()==0))
rpairs
with the weights attached. See the class documentation
("RecLinkData"
, "RLBigDataDedup "
and
"RLBigDataLinkage "
) on how weights are stored.
For the "RLBigData"
method, the returned object is only a shallow
copy in the sense that it links to the same database file as rpairs
"RLBigData"
method writes a table with weights in the database
file of rpairs
, which means that changes apply to the provided object
(similar to pass-by-reference style). If the existing state of rpairs
is to be preserved, a copy should be made using clone
before
applying this function."RecLinkData"
as well as S4 objects of classes "RLBigDataDedup "
and
"RLBigDataLinkage "
.
The weight for a record pair $(x^{1},x^{2})$ is computed by
the formula
$$\frac{\sum_{i}w_{i}s(x^{1}_{i},x^{2}_{i})}{\sum_{i}w_{i}}$$
where $s(x^{1}_{i},x^{2}_{i})$ is the value of a string comparison of
records $x^{1}$ and $x^{2}$ in the i-th field and
$w_{i}$ is a weighting factor computed by
$$w_{i}=\log_{2}(1-e_{i})/f_{i}$$
where $f_{i}$ denotes the
average frequency of values and $e_{i}$ the estimated error rate
for field $i$.
String comparison values are taken from the record pairs as they were
generated with compare.dedup
or compare.linkage
.
The use of binary patterns is possible, but in general yields poor results.
The average frequency of values is by default taken from the object
rpairs
. Both frequency and error rate e
can be set to a single
value, which will be recycled, or to a vector with distinct error rates for
every field.
The error rate(s) and frequencie(s) must satisfy
$e_{i}\leq{}1-f_{i}$ for all $i$, otherwise
the functions fails. Also, some other rare combinations can result in weights
with illegal values (NaN, less than 0 or greater than 1). In this case a
warning is issued.
By default, the "RLBigDataDedup "
method displays a
progress bar unless output is diverted by sink
, e.g. when processing
a Sweave file.epiClassify
for classification based on EpiLink weights.
emWeights
for a different approach for weight calculation.# generate record pairs
data(RLdata500)
p=compare.dedup(RLdata500,strcmp=TRUE ,strcmpfun=levenshteinSim,
identity=identity.RLdata500)
# calculate weights
p=epiWeights(p)
# classify and show results
summary(epiClassify(p,0.6))
Run the code above in your browser using DataLab