folds.svydesign: Creating CV folds based on the `svydesign` object

Description

Wrapper function which takes a svydesign object and desired number of CV folds, and passes it into folds.svy. Returns a vector of fold IDs, which in most cases you will want to append to your svydesign object using update.svydesign (see Examples below). These fold IDs respect any stratification or clustering in the survey design. You can then carry out K-fold CV as usual, taking care to also use the survey design features and survey weights when fitting models in each training set and also when evaluating models against each test set.

Usage

folds.svydesign(design_object, nfolds)

Arguments

design_object

Name of a svydesign object created using the survey package. The arguments id and strata (if used) must be specified as formulas, e.g. svydesign(ids = ~MyPSUs, ...).

nfolds

Number of folds to be used during cross validation

Value

Integer vector of fold IDs with length nrow(Data). Most likely you will want to append the returned vector to the svydesign object, for instance with update.svydesign (see Examples below).

Details

For the special cases of linear or logistic GLMs, use instead cv.svydesign or cv.svyglm which will automate the whole CV process for you.

Examples

Run this code

# NOT RUN {
# Set up CV folds for a stratified sample and a one-stage cluster sample,
# using data from the `survey` package
library(survey)
data("api", package = "survey")
# stratified sample
dstrat <- svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat,
                    fpc = ~fpc)
dstrat <- update(dstrat, .foldID = folds.svydesign(dstrat, nfolds = 5))
# Each fold will have observations from every stratum
with(dstrat$variables, table(stype, .foldID))
# Fold sizes should be roughly equal
table(dstrat$variables$.foldID)
#
# one-stage cluster sample
dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)
dclus1 <- update(dclus1, .foldID = folds.svydesign(dclus1, nfolds = 5))
# For any given cluster, all its observations will be in the same fold;
# and each fold should contain roughly the same number of clusters
with(dclus1$variables, table(dnum, .foldID))
# But if cluster sizes are unequal,
# the number of individuals per fold will also vary
table(dclus1$variables$.foldID)
# See the end of `intro` vignette for an example of using such folds
# as part of a custom loop over CV folds
# to tune parameters in a design-consistent random forest model
# }

Run the code above in your browser using DataLab