Our internal research has given indication that in the context of Record Linkage
with supervised classification procedures small training sets are often
sufficient, provided they cover the whole range of present comparison patterns.
By default, this function creates a minimal training set that is
a subset of the record pairs to be classified in which every present
comparison pattern is represented by exactly one training example.
By this approach, the work to classify a training set by
clerical review can be minimized while keeping a good classification
performance.
Larger training sets can be obtained by setting nEx
to a
higher number. Up to nEx
examples for every comparison pattern
are randomly selected, limited by the total number of record pairs with
that pattern.