errorest_cv: Calculates the Cross-Validation Error Rate for a specified classifier given a data set.

Description

For a given data matrix and its corresponding vector of labels, we calculate the cross-validation (CV) error rate for a given classifier.

Usage

errorest_cv(x, y, train, classify, num_folds = 10,
    hold_out = NULL, ...)

Arguments

a matrix of n observations (rows) and p features (columns)

a vector of n class labels

train

a function that builds the classifier. (See details.)

classify

a function that classifies observations from the constructed classifier from train. (See details.)

num_folds

the number of cross-validation folds. Ignored if hold_out is not NULL. See Details.

hold_out

the hold-out size for cross-validation. See Details.

...

additional arguments passed to the function specified in train.

Value

the calculated CV error-rate estimate

Details

To calculate the CV error rate, we partition the data set into 'folds'. For each fold, we consider the observations within the fold as a test data set, while the remaining observations are considered as a training data set. We then calculate the number of misclassified observations within the fold. The CV error rate is the proportion of misclassified observations across all folds.

Rather than partitioning the observations into folds, an alternative convention is to specify the 'hold-out' size for each test data set. Note that this convention is equivalent to the notion of folds. We allow the user to specify either option with the hold_out and num_folds arguments. The num_folds argument is the default option but is ignored if the hold_out argument is specified (i.e. is not NULL).

For the given classifier, two functions must be provided 1. to train the classifier and 2. to classify unlabeled observations. The training function is provided as train and the classification function as classify.

We expect that the first two arguments of the train function are x and y, corresponding to the data matrix and the vector of their labels, respectively. Additional arguments can be passed to the train function.

We stay with the usual R convention for the classify function. We expect that this function takes two arguments: 1. an object argument which contains the trained classifier returned from the function specified in train; and 2. a newdata argument which contains a matrix of observations to be classified -- the matrix should have rows corresponding to the individual observations and columns corresponding to the features (covariates). For an example, see lda.

Examples

Run this code

require('MASS')
iris_x <- data.matrix(iris[, -5])
iris_y <- iris[, 5]

# Because the \\code{classify} function returns multiples objects in a list,
# we provide a wrapper function that returns only the class labels.
lda_wrapper <- function(object, newdata) { predict(object, newdata)$class }

set.seed(42)
errorest_cv(x = iris_x, y = iris_y, train = MASS:::lda, classify = lda_wrapper)
# Output: 0.02666667