RecordLinkage (version 0.4-11)

getMinimalTrain: Create a minimal training set

Description

Samples a subset of the provided data (comparison patterns) so that every comparison pattern in rpairs is represented in the subset at least once.

Usage

getMinimalTrain(rpairs, nEx = 1)

Arguments

rpairs

A "RecLinkData" or "RLBigData" object. The data set from which to create a minimal training set.

nEx

The desired number of examples per comparison pattern.

Value

An object of the same class as rpairs, representing a minimal comprehensive training set. The appropriate subset of comparison patterns (and weights, if present) is taken, all other components are copied.

Details

Our internal research has given indication that in the context of Record Linkage with supervised classification procedures small training sets are often sufficient, provided they cover the whole range of present comparison patterns. By default, this function creates a minimal training set that is a subset of the record pairs to be classified in which every present comparison pattern is represented by exactly one training example. By this approach, the work to classify a training set by clerical review can be minimized while keeping a good classification performance.

Larger training sets can be obtained by setting nEx to a higher number. Up to nEx examples for every comparison pattern are randomly selected, limited by the total number of record pairs with that pattern.

See Also

editMatch for manually setting the matching status of the training pairs.

Examples

Run this code
# NOT RUN {
data(RLdata500)
p <- compare.dedup(RLdata500,blockfld=list(1,3),identity=identity.RLdata500)
train <- getMinimalTrain(p)
classif <- trainSupv(train,method="bagging")
summary(classifySupv(classif,newdata=p))
# }

Run the code above in your browser using DataCamp Workspace