cvTool: Low-level function for cross-validation

Description

Basic function to estimate the prediction error of a model via (repeated) $K$-fold cross-validation. The model is thereby specified by an unevaluated function call to a model fitting function.

Usage

cvTool(call, data = NULL, x = NULL, y, cost = rmspe,
     folds, names = NULL, predictArgs = list(), costArgs =
     list(), envir = parent.frame())

Arguments

call

an unevaluated function call for fitting a model (see call).

data

a data frame containing the variables required for fitting the models. This is typically used if the model in the function call is described by a formula.

a numeric matrix containing the predictor variables. This is typically used if the function call for fitting the models requires the predictor matrix and the response to be supplied as separate arguments.

a numeric vector or matrix containing the response.

cost

a cost function measuring prediction loss. It should expect the observed values of the response to be passed as the first argument and the predicted values as the second argument, and must return a non-negative scalar value. The default is to use

folds

an object of class "cvFolds" giving the folds of the data for cross-validation (as returned by cvFolds).

names

an optional character vector giving names for the arguments containing the data to be used in the function call (see Details).

predictArgs

a list of additional arguments to be passed to the predict method of the fitted models.

costArgs

a list of additional arguments to be passed to the prediction loss function cost.

envir

the environment in which to evaluate the function call for fitting the models (see eval).

Value

A numeric matrix in which each column contains the respective estimated prediction errors from all replications.

Details

(Repeated) $K$-fold cross-validation is performed in the following way. The data are first split into $K$ previously obtained blocks of approximately equal size (given by folds). Each of the $K$ data blocks is left out once to fit the model, and predictions are computed for the observations in the left-out block with the predict method of the fitted model. Thus a prediction is obtained for each observation.

The response variable and the obtained predictions for all observations are then passed to the prediction loss function cost to estimate the prediction error. For repeated cross-validation (as indicated by folds), this process is replicated and the estimated prediction errors from all replications are returned.

Furthermore, if the response is a vector but the predict method of the fitted models returns a matrix, the prediction error is computed for each column. A typical use case for this behavior would be if the predict method returns predictions from an initial model fit and stepwise improvements thereof.

If data is supplied, all variables required for fitting the models are added as one argument to the function call, which is the typical behavior of model fitting functions with a formula interface. In this case, a character string specifying the argument name can be passed via names (the default is to use "data").

If x is supplied, on the other hand, the predictor matrix and the response are added as separate arguments to the function call. In this case, names should be a character vector of length two, with the first element specifying the argument name for the predictor matrix and the second element specifying the argument name for the response (the default is to use c("x", "y")). It should be noted that data takes precedence over x if both are supplied.

Examples

Run this code

library("robustbase")
data("coleman")
set.seed(1234)  # set seed for reproducibility

# set up function call for an MM regression model
call <- call("lmrob", formula = Y ~ .)
# set up folds for cross-validation
folds <- cvFolds(nrow(coleman), K = 5, R = 10)

# perform cross-validation
cvTool(call, data = coleman, y = coleman$Y, cost = rtmspe, 
    folds = folds, costArgs = list(trim = 0.1))

Run the code above in your browser using DataLab