splitData: Split Data

Description

Splits a data set into two sets with desired proportions.

Usage

splitData(dataset, prop, keep.mprop = FALSE, num.non = 0, des.mprop = 0, 
use.pred = FALSE)

Arguments

dataset

Object of class RecLinkData. Data pairs to split.

prop

Real number between 0 and 1. Proportion of data pairs to form the training set.

keep.mprop

Logical. Whether the ratio of matches should be retained.

num.non

Positive Integer. Desired number on non-matches in the training set.

des.mprop

Real number between 0 and 1. Desired proportion of matches to non-matches in the training set.

use.pred

Logical. Whether to apply match ratio to previous classification results instead of true matching status.

Value

A list of RecLinkData objects.
trainThe sampled training data.
validAll other record pairs
The sampled data are stored in the pairs attributes of train and valid. If present, the attributes prediction and Wdata are split and the corresponding values saved. All other attributes are copied to both data sets. If the number of desired matches or non-matches is higher than the number actually present in the data, the maximum possible number is chosen and a warning issued.

Examples

Run this code

data(RLdata500)
pairs=compare.dedup(RLdata500, identity=identity.RLdata500, 
  blockfld=list(1,3,5,6,7))

# split into halves, do not enforce match ratio
l=splitData(pairs, prop=0.5)
summary(l$train)
summary(l$valid)

# split into 1/3 and 2/3, retain match ration
l=splitData(pairs, prop=1/3, keep.mprop=TRUE)
summary(l$train)
summary(l$valid)

# generate a training set with 100 non-matches and 10 matches
l=splitData(pairs, num.non=100, des.mprop=0.1, keep.mprop=TRUE)
summary(l$train)
summary(l$valid)

Run the code above in your browser using DataLab