To calculate the BCV error rate, we sample from the data
with replacement to obtain a bootstrapped training data
set. We then compute a cross-validation error rate with
the given classifier (given in train
) on the
bootstrapped training data set. We repeat this process
num_bootstraps
times to obtain a set of
bootstrapped cross-validation error rates. We report the
average of these error rates. The
errorest_cv
function is used to compute the
cross-validation (CV) error rate estimator for each
bootstrap iteration. Fu et al. (2005) note that the BCV method works well
because it is a bagging classification error.
Furthermore, consider the leave-one-out (LOO) error rate
estimator. For small sample sizes, the data are sparse,
so that the left out observation has a high probability
of being far in distance from the remaining training data
set. Hence, the LOO error rate estimator yields a large
variance for small data sets.
Rather than partitioning the observations into folds, an
alternative convention is to specify the 'hold-out' size
for each test data set. Note that this convention is
equivalent to the notion of folds. We allow the user to
specify either option with the hold_out
and
num_folds
arguments. The num_folds
argument
is the default option but is ignored if the
hold_out
argument is specified (i.e. is not
NULL
).
We expect that the first two arguments of the classifier
function given in train
are x
and y
,
corresponding to the data matrix and the vector of their
labels. Additional arguments can be passed to the
train
function. The returned object should be a
classifier that will be passed to the function given in
the classify
argument.
We stay with the usual R convention for the
classify
function. We expect that this function
takes two arguments: 1. an object
argument which
contains the trained classifier returned from the
function specified in train
; and 2. a
newdata
argument which contains a matrix of
observations to be classified -- the matrix should have
rows corresponding to the individual observations and
columns corresponding to the features (covariates).