Learn R Programming

PPRL (version 0.3.4)

BloomFilterLinkage: Bloom Filter-based linkage using Multibit Trees

Description

Linking Bloom Filters using Multibit Trees or Union-bit Trees.

Usage

BloomFilterLinkage(IDA, dataA, IDB, dataB, blocking = NULL, similarity)

Arguments

IDA

A character vector or integer vector containing the IDs of the first data.frame.

dataA

A data.frame containing Bloom Filters in the column specified in SelectSimilarityFunction.

IDB

A character vector or integer vector containing the IDs of the second data.frame.

dataB

A data.frame containing Bloom Filters in the column specified in SelectSimilarityFunction.

Value

A data.frame containing ID-pairs considered as links and their respective similarity values.

Details

Two character vectors of Bloom Filters/CLKs including their IDs are compared using tree-based methods. An index tree is built from the first input (input A). The second input (B) is queried sequentially; all record pairs over the set threshold will be considered as links by the algorithm.

To call BloomFilterLinkage it is necessary to select the similarity function in SelectSimilarityFunctionBF. To use external blocking, calling SelectBlockingFunction is required.

References

Bachteler, T., Reiher, J., Schnell, R. (2013): Similarity Filtering with Multibit Trees for Record Linkage. German Record Linkage Center Working Paper WP-GRLC-2013-01.

Kristensen, T. G., Nielsen, J., Pedersen, C. N. (2010): A Tree-based Method for the Rapid Screening of Chemical Fingerprints. Algorithms for Molecular Biology 5(9).

Schnell, R. (2014): An efficient Privacy-Preserving Record Linkage Technique for Administrative Data and Censuses. Journal of the International Association for Official Statistics (IAOS) 30: 263-270.

See Also

PPRL, SelectBlockingFunction, SelectSimilarityFunctionBF, StandardizeString

Examples

Run this code
# NOT RUN {
# load test data
testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv")
testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character")

# create Bloom Filters
testDataBF <- CreateBF(ID = testData$V1, testData$V7,
  k = 20, padding = 1, q = 2, l = 1000, password = "(H]$6Uh*-Z204q")

# define bloom filter column in data and select similarity function and threshold
lBF <- SelectSimilarityFunctionBF("CLKs","CLKs", method = "mtan",
  threshold = 0.85)

# calculate result (in this example data is linked to itself)
linked <- BloomFilterLinkage(testDataBF$ID, testDataBF, testDataBF$ID, testDataBF,
  blocking = NULL, similarity = lBF)
# }

Run the code above in your browser using DataLab