cross_validation: Perform Cross-Validation for Model Estimation

Description

This function performs cross-validation for estimating risk over a sequence of tuning parameters (tau_seq) by fitting a Generalized Linear Model (GLM) to the data. It evaluates model performance by splitting the dataset into multiple folds, training the model on a subset of the data, and testing it on the remaining portion.

Usage

cross_validation(
  formula,
  cat_init,
  tau_seq,
  discrepancy_method,
  cross_validation_fold_num,
  ...
)

Value

A numeric vector of averaged risk estimates, one for each value of tau in tau_seq.

Arguments

formula: A formula specifying the GLMs. Should at least include response variables.
cat_init: A list generated from cat_glm_initialization.
tau_seq: A sequence of tuning parameter values (tau) over which cross-validation will be performed. Each value of tau is used to weight the synthetic data during model fitting.
discrepancy_method: A function used to calculate the discrepancy (error) between model predictions and actual values.
cross_validation_fold_num: The number of folds to use in cross-validation. The dataset will be randomly split into this number of subsets, and the model will be trained and tested on different combinations of these subsets.
...: Other arguments passed to other internal functions.

Details

Randomization of the Data: The data is randomly shuffled into cross_validation_fold_num subsets to ensure that the model is evaluated across different splits of the dataset.
Model Training and Prediction: For each fold, a training set is used to fit a GLM with varying values of tau (from tau_seq), and the model is evaluated on a test set. The training data consists of both the observed and synthetic data, with synthetic data weighted by tau.
Risk Estimation: After fitting the model, the discrepancy_method is used to calculate the prediction error for each combination of fold and tau. These errors are accumulated for each tau.
Average Risk Estimate: After completing all folds, the accumulated prediction errors are averaged over the number of folds to provide a final risk estimate for each value of tau.