cvExamples: Cross-validation for linear models

Description

Estimate the prediction error of a linear model via (repeated) $K$-fold cross-validation. Cross-validation functions are available for least squares fits computed with lm as well as for the following robust alternatives: MM-type models computed with lmrob and least trimmed squares fits computed with ltsReg.

Usage

cvLm(object, cost = rmspe, K = 5, R = 1, foldType =
     c("random", "consecutive", "interleaved"), folds =
     NULL, seed = NULL, ...)
 cvLmrob(object, cost = rtmspe, K = 5, R = 1, foldType =
     c("random", "consecutive", "interleaved"), folds =
     NULL, seed = NULL, ...)
 cvLts(object, cost = rtmspe, K = 5, R = 1, foldType =
     c("random", "consecutive", "interleaved"), folds =
     NULL, fit = c("reweighted", "raw", "both"), seed =
     NULL, ...)

Arguments

object

for cvLm, an object of class "lm" computed with lm. For cvLmrob, an object of class "lmrob" computed with

cost

a cost function measuring prediction loss. It should expect the observed values of the response to be passed as the first argument and the predicted values as the second argument, and must return a non-negative scalar value. The default is to use

an integer giving the number of groups into which the data should be split (the default is five). Keep in mind that this should be chosen such that all groups are of approximately equal size. Setting K equal to n yields

an integer giving the number of replications for repeated $K$-fold cross-validation. This is ignored for for leave-one-out cross-validation and other non-random splits of the data.

foldType

a character string specifying the type of folds to be generated. Possible values are "random" (the default), "consecutive" or "interleaved".

folds

an object of class "cvFolds" giving the folds of the data for cross-validation (as returned by cvFolds). If supplied, this is preferred over K and R.

fit

a character string specifying for which fit to estimate the prediction error. Possible values are "reweighted" (the default) for the prediction error of the reweighted fit, "raw" for the prediction error of the raw fit,

seed

optional initial seed for the random number generator (see .Random.seed).

...

additional arguments to be passed to the prediction loss function cost.

Value

An object of class "cv" with the following components:
nan integer giving the number of observations.
Kan integer giving the number of folds.
Ran integer giving the number of replications.
cva numeric vector containing the estimated prediction errors. For cvLm and cvLmrob, this is a single numeric value. For cvLts, this contains one value for each of the requested fits. In the case of repeated cross-validation, those are average values over all replications.
repsa numeric matrix containing the estimated prediction errors from all replications. For cvLm and cvLmrob, this is a matrix with one column. For cvLts, this contains one column for each of the requested fits. However, this is only returned for repeated cross-validation.
seedthe seed of the random number generator before cross-validation was performed.
callthe matched function call.

Details

(Repeated) $K$-fold cross-validation is performed in the following way. The data are first split into $K$ previously obtained blocks of approximately equal size. Each of the $K$ data blocks is left out once to fit the model, and predictions are computed for the observations in the left-out block with the predict method of the fitted model. Thus a prediction is obtained for each observation.

The response variable and the obtained predictions for all observations are then passed to the prediction loss function cost to estimate the prediction error. For repeated cross-validation, this process is replicated and the estimated prediction errors from all replications as well as their average are included in the returned object.

Examples

Run this code

library("robustbase")
data("coleman")
set.seed(1234)  # set seed for reproducibility

# set up folds for cross-validation
folds <- cvFolds(nrow(coleman), K = 5, R = 10)

# perform cross-validation for an LS regression model
fitLm <- lm(Y ~ ., data = coleman)
cvLm(fitLm, cost = rtmspe, folds = folds, trim = 0.1)

# perform cross-validation for an MM regression model
fitLmrob <- lmrob(Y ~ ., data = coleman)
cvLmrob(fitLmrob, cost = rtmspe, folds = folds, trim = 0.1)

# perform cross-validation for an LTS regression model
fitLts <- ltsReg(Y ~ ., data = coleman)
cvLts(fitLts, cost = rtmspe, folds = folds, trim = 0.1)
cvLts(fitLts, cost = rtmspe, folds = folds, 
    fit = "both", trim = 0.1)

Run the code above in your browser using DataLab