cv.svyglm: CV for `svyglm` objects

Description

Wrapper function which takes a svyglm object (which itself contains a svydesign object) and passes it through cv.svydesign to cv.svy. Chooses linear or logistic regression based on the svyglm object's value of family. Returns survey CV estimates of the mean loss for each model (MSE for linear models, or binary cross-entropy for logistic models).

Usage

cv.svyglm(glm_object, nfolds = 5, na.rm = FALSE)

Arguments

glm_object

Name of a svyglm object created from the survey package

nfolds

Number of folds to be used during cross validation, defaults to 5

na.rm

Whether to drop cases with missing values when taking `svymean` of test losses

Value

Object of class svystat, which is a named vector with the survey CV estimate of the mean loss (MSE for linear models, or binary cross-entropy for logistic models) for the model in the svyglm object provided to glm_object; and with a var attribute giving the variance. See surveysummary for details.

Details

If you have created a svydesign object and want to compare several svyglm models, you may prefer the function cv.svydesign.

For models other than linear or logistic regression, you can use folds.svy or folds.svydesign to generate CV fold IDs that respect any stratification or clustering in the survey design. You can then carry out K-fold CV as usual, taking care to also use the survey design features and survey weights when fitting models in each training set and also when evaluating models against each test set.

Examples

Run this code

# NOT RUN {
# Calculate CV MSE and its SE under one `svyglm` linear model
# for a stratified sample and a one-stage cluster sample,
# using data from the `survey` package
library(survey)
data("api", package = "survey")
# stratified sample
dstrat <- svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat,
                    fpc = ~fpc)
glmstrat <- svyglm(api00 ~ ell+meals+mobility, design = dstrat)
cv.svyglm(glmstrat, nfolds = 5)
# one-stage cluster sample
dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)
glmclus1 <- svyglm(api00 ~ ell+meals+mobility, design = dclus1)
cv.svyglm(glmclus1, nfolds = 5)

# Calculate CV MSE and its SE under one `svyglm` linear model
# for a stratified cluster sample with clusters nested within strata
data(NSFG_data)
library(splines)
NSFG.svydes <- svydesign(id = ~SECU, strata = ~strata, nest = TRUE,
                         weights = ~wgt, data = NSFG_data)
NSFG.svyglm <- svyglm(income ~ ns(age, df = 3), design = NSFG.svydes)
cv.svyglm(glm_object = NSFG.svyglm, nfolds = 4)

# Logistic regression example, using the same stratified cluster sample;
# instead of CV MSE, we calculate CV binary cross-entropy loss,
# where (as with MSE) lower values indicate better fitting models
# (NOTE: na.rm=TRUE is not usually ideal;
#  it's used below purely for convenience, to keep the example short,
#  but a thorough analysis would look for better ways to handle the missing data)
NSFG.svyglm.logreg <- svyglm(KnowPreg ~ ns(age, df = 2),
                             design = NSFG.svydes, family = quasibinomial())
cv.svyglm(glm_object = NSFG.svyglm.logreg, nfolds = 4, na.rm = TRUE)
# }

Run the code above in your browser using DataLab