Learn R Programming

PPRL (version 0.3.8)

ProbabilisticLinkage: Probabilistic Record Linkage

Description

Probabilistic Record Linkage of two data sets using distance-based or probabilistic methods.

Usage

ProbabilisticLinkage(IDA, dataA, IDB, dataB,  blocking = NULL, similarity)

Value

A data.frame containing pairs of IDs, their corresponding similarity value and the match status as determined by the linkage procedure.

Arguments

IDA

A character vector or integer vector containing the IDs of the first data.frame.

dataA

A data.frame containing the data to be linked and all linking variables as specified in SelectBlockingFunction and SelectSimilarityFunction.

IDB

A character vector or integer vector containing the IDs of the second data.frame.

dataB

A data.frame containing the data to be linked and all linking variables as specified in SelectBlockingFunction and SelectSimilarityFunction.

blocking

Optional blocking variables. See SelectBlockingFunction.

similarity

Variables used for linking and their respective linkage methods as specified in SelectSimilarityFunction.

Details

To call the Probabilistic Linkage function it is necessary to set up linking variables and methods. Using blocking variables is optional. Further options are available in SelectBlockingFunction and SelectSimilarityFunction. Using this method, the Fellegi-Sunter model is used, with the EM algorithm estimating the weights (Winkler 1988).

See Also

CreateBF, CreateCLK, DeterministicLinkage, SelectBlockingFunction, SelectSimilarityFunction, StandardizeString

Examples

Run this code
# load test data
testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv")
testData <- read.csv(testFile, head = FALSE, sep = "\t",
  colClasses = "character")

# define year of birth (V3) as blocking variable
bl <- SelectBlockingFunction("V3", "V3", method = "exact")

# Select first name and last name as linking variables,
# to be linked using the Jaro-Winkler similarity measure (first name)
# and levenshtein distance (last name)
l1 <- SelectSimilarityFunction("V7", "V7", method = "jw")
l2 <- SelectSimilarityFunction("V8", "V8", method = "lv")

# Link the data as specified in bl and l1/l2
# (in this small example data is linked to itself)
res <- ProbabilisticLinkage(testData$V1, testData,
  testData$V1, testData, similarity = c(l1, l2), blocking = bl)

Run the code above in your browser using DataLab