cvTuning: Cross-validation for tuning parameter selection

Description

Select tuning parameters of a model by estimating the respective prediction errors via (repeated) $K$-fold cross-validation. It is thereby possible to supply a model fitting function or an unevaluated function call to a model fitting function.

Usage

cvTuning(object, ...)
 ## S3 method for class 'function':
cvTuning(object, formula, data = NULL,
     x = NULL, y, tuning = list(), args = list(), cost =
     rmspe, K = 5, R = 1, foldType = c("random",
     "consecutive", "interleaved"), folds = NULL, names =
     NULL, predictArgs = list(), costArgs = list(), envir =
     parent.frame(), seed = NULL, ...)
 ## S3 method for class 'call':
cvTuning(object, data = NULL, x = NULL, y,
     tuning = list(), cost = rmspe, K = 5, R = 1, foldType
     = c("random", "consecutive", "interleaved"), folds =
     NULL, names = NULL, predictArgs = list(), costArgs =
     list(), envir = parent.frame(), seed = NULL, ...)

Arguments

object

a function or an unevaluated function call for fitting a model (see call for the latter).

formula

a formula describing the model.

data

a data frame containing the variables required for fitting the models. This is typically used if the model in the function call is described by a formula.

a numeric matrix containing the predictor variables. This is typically used if the function call for fitting the models requires the predictor matrix and the response to be supplied as separate arguments.

a numeric vector or matrix containing the response.

tuning

a list of arguments giving the tuning parameter values to be evaluated. The names of the list components should thereby correspond to the argument names of the tuning parameters. For each tuning parameter, a vector of values can be supplied. C

args

a list of additional arguments to be passed to the model fitting function.

cost

a cost function measuring prediction loss. It should expect the observed values of the response to be passed as the first argument and the predicted values as the second argument, and must return a non-negative scalar value. The default is to use

an integer giving the number of groups into which the data should be split (the default is five). Keep in mind that this should be chosen such that all groups are of approximately equal size. Setting K equal to n yields

an integer giving the number of replications for repeated $K$-fold cross-validation. This is ignored for for leave-one-out cross-validation and other non-random splits of the data.

foldType

a character string specifying the type of folds to be generated. Possible values are "random" (the default), "consecutive" or "interleaved".

folds

an object of class "cvFolds" giving the folds of the data for cross-validation (as returned by cvFolds). If supplied, this is preferred over K and R.

names

an optional character vector giving names for the arguments containing the data to be used in the function call (see Details).

predictArgs

a list of additional arguments to be passed to the predict method of the fitted models.

costArgs

a list of additional arguments to be passed to the prediction loss function cost.

envir

the environment in which to evaluate the function call for fitting the models (see eval).

seed

optional initial seed for the random number generator (see .Random.seed).

...

additional arguments to be passed down.

Value

If tuning is an empty list, cvFit is called to return an object of class "cv".
Otherwise an object of class "cvTuning" (which inherits from class "cvSelect") with the following components is returned:
nan integer giving the number of observations.
Kan integer giving the number of folds.
Ran integer giving the number of replications.
tuninga data frame containing the grid of tuning parameter values for which the prediction error was estimated.
bestan integer vector giving the indices of the optimal combinations of tuning parameters.
cva data frame containing the estimated prediction errors for all combinations of tuning parameter values. For repeated cross-validation, those are average values over all replications.
repsa data frame containing the estimated prediction errors from all replications for all combinations of tuning parameter values. This is only returned for repeated cross-validation.
seedthe seed of the random number generator before cross-validation was performed.
callthe matched function call.

Details

(Repeated) $K$-fold cross-validation is performed in the following way. The data are first split into $K$ previously obtained blocks of approximately equal size. Each of the $K$ data blocks is left out once to fit the model, and predictions are computed for the observations in the left-out block with the predict method of the fitted model. Thus a prediction is obtained for each observation.

The response variable and the obtained predictions for all observations are then passed to the prediction loss function cost to estimate the prediction error. For repeated cross-validation, this process is replicated and the estimated prediction errors from all replications as well as their average are included in the returned object.

Furthermore, if the response is a vector but the predict method of the fitted models returns a matrix, the prediction error is computed for each column. A typical use case for this behavior would be if the predict method returns predictions from an initial model fit and stepwise improvements thereof.

If formula or data are supplied, all variables required for fitting the models are added as one argument to the function call, which is the typical behavior of model fitting functions with a formula interface. In this case, the accepted values for names depend on the method. For the function method, a character vector of length two should supplied, with the first element specifying the argument name for the formula and the second element specifying the argument name for the data (the default is to use c("formula", "data")). Note that names for both arguments should be supplied even if only one is actually used. For the call method, which does not have a formula argument, a character string specifying the argument name for the data should be supplied (the default is to use "data").

If x is supplied, on the other hand, the predictor matrix and the response are added as separate arguments to the function call. In this case, names should be a character vector of length two, with the first element specifying the argument name for the predictor matrix and the second element specifying the argument name for the response (the default is to use c("x", "y")). It should be noted that the formula or data arguments take precedence over x.

Examples

Run this code

library("robustbase")
data("coleman")

## evaluate MM regression models tuned for 85% and 95% efficiency
tuning <- list(tuning.psi = c(3.443689, 4.685061))

## via model fitting function
# perform cross-validation
# note that the response is extracted from 'data' in 
# this example and does not have to be supplied
cvTuning(lmrob, formula = Y ~ ., data = coleman, tuning = tuning, 
    cost = rtmspe, K = 5, R = 10, costArgs = list(trim = 0.1), 
    seed = 1234)

## via function call
# set up function call
call <- call("lmrob", formula = Y ~ .)
# perform cross-validation
cvTuning(call, data = coleman, y = coleman$Y, tuning = tuning, 
    cost = rtmspe, K = 5, R = 10, costArgs = list(trim = 0.1), 
    seed = 1234)

Run the code above in your browser using DataLab