Learn R Programming

RecordLinkage (version 0.2-0)

genSamples: Generate Training Set

Description

Generates training data by unsupervised classification.

Usage

genSamples(dataset, num.non, des.mprop = 0.1)

Arguments

dataset
Object of class RecLinkData. Data pairs from which to sample.
num.non
Positive Integer. Number of desired non-links in the training set.
des.mprop
Real number between 0 and 1. Ratio of number of links to number of non-links in the training set.

Value

  • A list of RecLinkResult objects.
  • trainThe sampled training data.
  • validAll other record pairs
  • Record pairs are split into the respective pairs components. The prediction components represent the clustering result. If weights are present in dataset, the corresponding fractions of Wdata are stored to train and valid. All other components are copied from dataset.

Details

The application of supervised classifiers (via classifyUnsup) requires a sufficient training set of record pairs with known matching status. Where no such data are available, genSamples can be used to generate training data. The linkage status is classified based on unsupervised clustering with bclust and the desired number of links and non-links are sampled. If the requested numbers of matches or non-matches is not feasible, a warning is issued and the maximum possible number is considered.

See Also

splitData for splitting data sets without clustering.