Learn R Programming

RecordLinkage (version 0.3-2)

trainSupv: Train a Classifier

Description

Trains a classifier for supervised classification of record pairs.

Usage

trainSupv(rpairs, method, use.pred = FALSE, omit.possible = TRUE, 
  convert.na = TRUE, include.data = FALSE, ...)

Arguments

rpairs
Object of class RecLinkData. Training data.
method
A character vector. The classification method to use.
use.pred
Logical. Whether to use results of an unsupervised classification instead of true matching status.
omit.possible
Logical. Whether to remove pairs labeled as possible links or with unknown status.
convert.na
Logical. Whether to convert NAs to 0 in the comparison patterns.
include.data
Logical. Whether to include training data in the result object.
...
Further arguments to the training method.

Value

  • An object of class RecLinkClassif with the following components:
  • trainIf include.data is TRUE, a copy of rpairs, otherwise an empty data frame with the same column names.
  • modelThe model returned by the underlying training function.
  • methodA copy of the argument method.

Details

The given dataset is used as training data for a supervised classification. Either the true matching status has to be known for a sufficient number of data pairs or the data must have been classified previously, e.g. by using emClassify or classifyUnsup. In the latter case, argument use.pred has to be set to TRUE. A classifying method has to be provided as a character string (factors are converted to character) through argument method. The supported classifiers are:

[object Object],[object Object],[object Object],[object Object],[object Object] Arguments in ... are passed to the corresponding function.

Most classifiers cannot handle NAs in the data, so by default these are converted to 0 before training. By omit.possible = TRUE, possible links or pairs with unknown status are excluded from the trainings set. Setting this argument to FALSE allows three-class-classification (links, non-links and possible links), but the results tend to be poor. Leaving include.data=FALSE saves memory, setting it to TRUE can be useful for saving the classificator while keeping track of the underlying training data.

See Also

classifySupv for classifying with the trained model, classifyUnsup for unsupervised classification

Examples

Run this code
# Train a rpart decision tree with additional parameter minsplit
data(RLdata500)
pairs=compare.dedup(RLdata500, identity=identity.RLdata500,
                    blockfld=list(1,3,5,6,7))
model=trainSupv(pairs, method="rpart", minsplit=5)
summary(model)

Run the code above in your browser using DataLab