RecordLinkage (version 0.4-12.4)

RLBigDataDedup: Constructors for big data objects.

Description

These are constructors which initialize a record linkage setup for big datasets, either deduplication of one (RLBigDataDedup) or linkage of two datasets (RLBigDataLinkage).

Usage

RLBigDataDedup(dataset, identity = NA, blockfld = list(), exclude = numeric(0), 
  strcmp = numeric(0), strcmpfun = "jarowinkler", phonetic = numeric(0), 
  phonfun = "soundex")

RLBigDataLinkage(dataset1, dataset2, identity1 = NA, identity2 = NA, blockfld = list(), exclude = numeric(0), strcmp = numeric(0), strcmpfun = "jarowinkler", phonetic = numeric(0), phonfun = "soundex")

Value

An object of class "RLBigDataDedup" or

"RLBigDataLinkage", depending on the called function.

Arguments

dataset, dataset1, dataset2

Table of records to be deduplicated or linked. Either a data frame or a matrix.

identity, identity1, identity2

Optional vectors (are converted to factors) for identifying true matches and non-matches. In a deduplication process, two records dataset[i,] and dataset[j,] are a true match if and only if identity[i,]==identity[j,]. In a linkage process, two records dataset1[i,] and dataset2[j,] are a true match if and only if identity1[i,]==identity2[j,].

blockfld

Blocking field definition. A numeric or character vector or a list of several such vectors, corresponding to column numbers or names. See details and examples.

exclude

Columns to be excluded. A numeric or character vector corresponding to columns of dataset or dataset1 and dataset2 which should be excluded from comparison

strcmp

Determines usage of string comparison. If FALSE, no string comparison will be used; if TRUE, string comparison will be used for all columns; if a numeric or character vector is given, the string comparison will be used for the specified columns.

strcmpfun

Character string representing the string comparison function. Possible values are "jarowinkler" and "levenshtein".

phonetic

Determines usage of phonetic code. Used in the same manner as strcmp

.

phonfun

Character string representing the phonetic function. Currently, only "soundex" is supported (see soundex).

Side effects

The RSQLite database driver is initialized via dbDriver("SQLite") and a connection established and stored in the returned object. Extension functions for phonetic code and string comparison are loaded into the database. The records in dataset or dataset1 and dataset2 are stored in tables "data" or "data1" and "data2", respectively, and indices are created on all columns involved in blocking.

Author

Andreas Borg, Murat Sariyar

Details

These functions act as constructors for the S4 classes "RLBigDataDedup" and "RLBigDataLinkage". They make up the initial stage in a Record Linkage process using large data sets (>= 1.000.000 record pairs) after possibly normalizing the data. Two general scenarios are reflected by the two functions: RLBigDataDedup works on a single data set which is to be deduplicated, RLBigDataLinkage is intended for linking two data sets together. Their usage follows the functions compare.dedup and compare.linkage, which are recommended for smaller amounts of data, e.g. training sets.

Datasets are represented as data frames or matrices (typically of type character), each row representing one record, each column representing one attribute (like first name, date of birth,...). Row names are not retained in the record pairs. If an identifier other than row number is needed, it should be supplied as a designated column and excluded from comparison (see note on exclude below).

In case of RLBigDataLinkage, the two datasets must have the same number of columns and it is assumed that their column classes and semantics match. If present, the column names of dataset1 are assigned to dataset2 in order to enforce a matching format. Therefore, column names used in blockfld or other arguments refer to dataset1.

Each element of blockfld specifies a set of columns in which two records must agree to be included in the output. Each blocking definition in the list is applied individually, the sets obtained thereby are combined by a union operation. If blockfld is FALSE, no blocking will be performed, which leads to a large number of record pairs (\(\frac{n(n-1)}{2}\) where \(n\) is the number of records).

Fields can be excluded from the linkage process by supplying their column index in the vector exclude, which is especially useful for external identifiers. Excluded fields can still be used for blocking, also with phonetic code.

Phonetic codes and string similarity measures are supported for enhanced detection of misspellings. Applying a phonetic code leads to binary similarity values, where 1 denotes equality of the generated phonetic code. A string comparator leads to a similarity value in the range \([0,1]\). Using string comparison on a field for which a phonetic code is generated is possible, but issues a warning.

In contrast to the compare.* functions, phonetic coding and string comparison is not carried out in R, but by database functions. Supported functions are "soundex" for phonetic coding and "jarowinkler" and "levenshtein" for string comparison. See the documentation for their R equivalents (phonetic functions, string comparison) for further information.

See Also

"RLBigDataDedup", "RLBigDataLinkage", compare.dedup, compare.linkage, the vignette "Classes for record linkage of big data sets".

Examples

Run this code
data(RLdata500)
data(RLdata10000)
# deduplication without blocking, use string comparator on names
rpairs <- RLBigDataDedup(RLdata500, strcmp = 1:4)
# linkage with blocking on first name and year of birth, use phonetic
# code on first components of first and last name
rpairs <- RLBigDataLinkage(RLdata500, RLdata10000, blockfld = c(1, 7),
  phonetic = c(1, 3))
# deduplication with blocking on either last name or complete date of birth,
# use string comparator on all fields, include identity information
rpairs <- RLBigDataDedup(RLdata500, identity = identity.RLdata500, strcmp=TRUE,
  blockfld = list(1, c(5, 6, 7)))

Run the code above in your browser using DataLab