SelectSimilarityFunctionBF: Select Similarity Function for Bloom Filter-based methods

Description

To call BloomFilterLinkage it is mandatory to select a similarity function for each variable. Each element of the setup contains the two variable names and the method. For some methods further informations can be entered.

Usage

SelectSimilarityFunctionBF(variable1, variable2, method = 'mtan',
  threshold = 0.85, windowSize = 5, looseThreshold = 0.8, tightThreshold = 0.7,
  symdex = TRUE, leaflimit = 3, cores = 1)

Arguments

variable1

name of linking variable 1 in the data.frame. The column must be of type character, numeric or integer, containing the data to be merged. The data vector must have the same length as the ID vector.

variable2

name of linking variable 2 in the data.frame. The column must be of type character, numeric or integer, containing the data to be merged. The data vector must have the same length as the ID vector.

method

linking method. Possible values are:

'mtan' = Tanimoto Similarity/Jaccard Similarity for Bloom Filters/CLKs using Multibit trees
'mham' = Hamming distance for Bloom Filters/CLKs using Multibit trees
'utan' = Tanimoto Similarity/Jaccard Similarity for Bloom Filters/CLKs using union bit trees
'uham' = Hamming distance for Bloom Filters/CLKs using union bit trees
'CCtan' = Tanimoto Similarity/Jaccard Similarity for Bloom Filters/CLKs using canopy clustering
'CCtanXOR' = Tanimoto Similarity/Jaccard Similarity for Bloom Filters/CLKs using canopy clustering with XOR-filtering
'SNtan' = Tanimoto Similarity/Jaccard Similarity for Bloom Filters/CLKs using sorted nearest neighbourhood
'SNtanXOR' = Tanimoto Similarity/Jaccard Similarity for Bloom Filters/CLKs using sorted nearest neighbourhood with XOR-filtering

threshold

Numeric value giving the lower bound of the Tanimoto-coefficient or the normalised Hamming distance to search for. Must be in the range of 0.0 <= threshold <=1.

symdex

To deactivate symdex pre-processing, set symdex = FALSE for Multibit trees.

leaflimit

Optional parameter for Multibit trees specifying the maximum number of Bloom Filters/CLKs in a leaf.

cores

Optional parameter for Multibit trees specifying the number of parallel threads that shall be used to construct the search tree and perform the search within it.

looseThreshold

Numeric value giving the loose threshold for canopy clustering

tightThreshold

Numeric value giving the tight threshold for canopy clustering

windowSize

Integer value giving the window size for sorted neighbourhood searching

Value

Calling the function will return a short confirmation message only.

References

Bachteler, T., Reiher, J., Schnell, R. (2013): Similarity Filtering with Multibit Trees for Record Linkage. German Record Linkage Center Working Paper WP-GRLC-2013-01.

Kristensen, T. G., Nielsen, J., Pedersen, C. N. (2010): A Tree-based Method for the Rapid Screening of Chemical Fingerprints. Algorithms for Molecular Biology 5(9).

Schnell, R. (2014): An efficient Privacy-Preserving Record Linkage Technique for Administrative Data and Censuses. Journal of the International Association for Official Statistics (IAOS) 30: 263-270.

Tai, D., Fang, J. (2012): SymDex: Increasing the Efficiency of Chemical Fingerprint Similarity Searches for Comparing Large Chemical Libraries by Using Query Set Indexing. Journal of Chemical Information and Modeling 52: 1926-1935.

Examples

Run this code

# NOT RUN {
# load test data
testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv")
testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character")

# create Bloom Filters
testDataBF <- CreateBF(ID = testData$V1, testData$V7,
  k = 20, padding = 1, q = 2, l = 1000, password = "(H]$6Uh*-Z204q")

# define bloom filter column in data and select similarity function and threshold using
# multibit trees
lBF <- SelectSimilarityFunctionBF("CLKs","CLKs", method = "mtan", threshold = 0.85, symdex = TRUE,
leaflimit = 3, cores = 1)

# or

# define bloom filter column in data and select similarity function and threshold using
# canopy clustering
lBF <- SelectSimilarityFunctionBF("CLKs","CLKs", method = "CCtan", threshold = 0.85,
looseThreshold = 0.7, tightThreshold = 0.8)

# or

# define bloom filter column in data and select similarity function and threshold using
# sorted neighbourhood
lBF <- SelectSimilarityFunctionBF("CLKs","CLKs", method = "SNtan", threshold = 0.85, windowSize = 5)

# calculate result (in this example data is linked to itself)
linked <- BloomFilterLinkage(testDataBF$ID, testDataBF, testDataBF$ID, testDataBF,
  blocking = NULL, similarity = lBF)
# }

Run the code above in your browser using DataLab