RLBigDataDedup
)
or linkage of two datasets (RLBigDataLinkage
).RLBigDataDedup(dataset, identity = NA, blockfld = list(), exclude = numeric(0), strcmp = numeric(0), strcmpfun = "jarowinkler", phonetic = numeric(0), phonfun = "pho_h")
RLBigDataLinkage(dataset1, dataset2, identity1 = NA, identity2 = NA, blockfld = list(), exclude = numeric(0), strcmp = numeric(0), strcmpfun = "jarowinkler", phonetic = numeric(0), phonfun = "pho_h")
dataset[i,]
and dataset[j,]
are a true match if and oFALSE
, no
string comparison will be used; if TRUE
, string comparison
will be used for all columns; if a numeric or character vector
"jarowinkler"
and "levenshtein"
.strcmp
"pho_h"
is supported (see pho_h
)."RLBigDataDedup "
or
"RLBigDataLinkage "
, depending on the called function.dbDriver("SQLite")
and a connection established and stored in the returned object. Extension
functions for phonetic code and string comparison are loaded into the database.
The records in dataset
or dataset1
and dataset2
are stored in tables
"data"
or "data1"
and "data2"
, respectively, and
indices are created on all columns involved in blocking."RLBigDataDedup "
and "RLBigDataLinkage "
.
They make up the initial stage in a Record Linkage process using
large data sets (>= 1.000.000 record pairs) after possibly
normalizing the data. Two general
scenarios are reflected by the two functions: RLBigDataDedup
works on a
single data set which is to be deduplicated, RLBigDataLinkage
is intended
for linking two data sets together. Their usage follows the functions
compare.dedup
and compare.linkage
, which are recommended
for smaller amounts of data, e.g. training sets.
Datasets are represented as data frames or matrices (typically of type
character), each row representing one record, each column representing one
attribute (like first name, date of birth,...). Row names are not
retained in the record pairs. If an identifier other than row number is
needed, it should be supplied as a designated column and excluded from
comparison (see note on exclude
below).
In case of RLBigDataLinkage
, the two datasets must have the same number
of columns and it is assumed that their column classes and semantics match.
If present, the column names of dataset1
are assigned to dataset2
in order to enforce a matching format. Therefore, column names used in
blockfld
or other arguments refer to dataset1
.
Each element of blockfld
specifies a set of columns in which two
records must agree to be included in the output. Each blocking definition in
the list is applied individually, the sets obtained
thereby are combined by a union operation.
If blockfld
is FALSE
, no blocking will be performed,
which leads to a large number of record pairs
($\frac{n(n-1)}{2}$ where $n$ is the number of
records).
Fields can be excluded from the linkage process by supplying their column
index in the vector exclude
, which is espacially useful for
external identifiers. Excluded fields can still be used for
blocking, also with phonetic code.
Phonetic codes and string similarity measures are supported for enhanced
detection of misspellings. Applying a phonetic code leads to binary
similarity values, where 1 denotes equality of the generated phonetic code.
A string comparator leads to a similarity value in the range $[0,1]$.
Using string comparison on a field for which a phonetic code
is generated is possible, but issues a warning.
In contrast to the compare.*
functions, phonetic coding and string
comparison is not carried out in R, but by database functions. Supported
functions are "pho_h"
for phonetic coding and "jarowinkler"
and
"levenshtein"
for string comparison. See the documentation for their
R equivalents (phonetic functions,
string comparison) for further information."RLBigDataDedup "
, "RLBigDataLinkage "
,
compare.dedup
, compare.linkage
,
the vignette "Classes for record linkage of big data sets".data(RLdata500)
data(RLdata10000)
# deduplication without blocking, use string comparator on names
rpairs <- RLBigDataDedup(RLdata500, strcmp = 1:4)
# linkage with blocking on first name and year of birth, use phonetic
# code on first components of first and last name
rpairs <- RLBigDataLinkage(RLdata500, RLdata10000, blockfld = c(1, 7),
phonetic = c(1, 3))
# deduplication with blocking on either last name or complete date of birth,
# use string comparator on all fields, include identity information
rpairs <- RLBigDataDedup(RLdata500, identity = identity.RLdata500, strcmp=TRUE,
blockfld = list(1, c(5, 6, 7)))
Run the code above in your browser using DataLab