This function calculates the estimated K-fold cross-validation prediction error for generalized linear models.

`cv.glm(data, glmfit, cost, K)`

data

A matrix or data frame containing the data. The rows should be cases and the columns correspond to variables, one of which is the response.

glmfit

An object of class `"glm"`

containing the results of a generalized linear
model fitted to `data`

.

cost

A function of two vector arguments specifying the cost function for the
cross-validation. The first argument to `cost`

should correspond to the
observed responses and the second argument should correspond to the predicted
or fitted responses from the generalized linear model. `cost`

must return a
non-negative scalar value. The default is the average squared error function.

K

The number of groups into which the data should be split to estimate the
cross-validation prediction error. The value of `K`

must be such that all
groups are of approximately equal size. If the supplied value of `K`

does
not satisfy this criterion then it will be set to the closest integer which
does and a warning is generated specifying the value of `K`

used. The default
is to set `K`

equal to the number of observations in `data`

which gives the
usual leave-one-out cross-validation.

The returned value is a list with the following components.

The original call to `cv.glm`

.

The value of `K`

used for the K-fold cross validation.

A vector of length two. The first component is the raw cross-validation estimate of prediction error. The second component is the adjusted cross-validation estimate. The adjustment is designed to compensate for the bias introduced by not using leave-one-out cross-validation.

The value of `.Random.seed`

when `cv.glm`

was called.

The value of `.Random.seed`

is updated.

The data is divided randomly into `K`

groups. For each group the generalized
linear model is fit to `data`

omitting that group, then the function `cost`

is applied to the observed responses in the group that was omitted from the fit
and the prediction made by the fitted models for those observations.

When `K`

is the number of observations leave-one-out cross-validation is used
and all the possible splits of the data are used. When `K`

is less than
the number of observations the `K`

splits to be used are found by randomly
partitioning the data into `K`

groups of approximately equal size. In this
latter case a certain amount of bias is introduced. This can be reduced by
using a simple adjustment (see equation 6.48 in Davison and Hinkley, 1997).
The second value returned in `delta`

is the estimate adjusted by this method.

Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984)
*Classification and Regression Trees*. Wadsworth.

Burman, P. (1989) A comparative study of ordinary cross-validation,
*v*-fold cross-validation and repeated learning-testing methods.
*Biometrika*, **76**, 503--514

Davison, A.C. and Hinkley, D.V. (1997)
*Bootstrap Methods and Their Application*. Cambridge University Press.

Efron, B. (1986) How biased is the apparent error rate of a prediction rule?
*Journal of the American Statistical Association*, **81**, 461--470.

Stone, M. (1974) Cross-validation choice and assessment of statistical
predictions (with Discussion).
*Journal of the Royal Statistical Society, B*, **36**, 111--147.

# NOT RUN { # leave-one-out and 6-fold cross-validation prediction error for # the mammals data set. data(mammals, package="MASS") mammals.glm <- glm(log(brain) ~ log(body), data = mammals) (cv.err <- cv.glm(mammals, mammals.glm)$delta) (cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta) # As this is a linear model we could calculate the leave-one-out # cross-validation estimate without any extra model-fitting. muhat <- fitted(mammals.glm) mammals.diag <- glm.diag(mammals.glm) (cv.err <- mean((mammals.glm$y - muhat)^2/(1 - mammals.diag$h)^2)) # leave-one-out and 11-fold cross-validation prediction error for # the nodal data set. Since the response is a binary variable an # appropriate cost function is cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5) nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal) (cv.err <- cv.glm(nodal, nodal.glm, cost, K = nrow(nodal))$delta) (cv.11.err <- cv.glm(nodal, nodal.glm, cost, K = 11)$delta) # }

Run the code above in your browser using DataCamp Workspace