glmnet_with_cv: Fit a glmnet Model with Repeated Cross-Validation

Description

Repeated K-fold cross-validation over a per-alpha lambda path, with a combined 1-SE rule across repeats. Preserves fields expected by predict.svem_model() and internal prediction helpers. Optionally uses glmnet's built-in relaxed elastic net for both the warm-start path and each CV fit. When relaxed = TRUE, the final coefficients are taken from a cv.glmnet() object at the chosen lambda so that the returned model reflects the relaxed solution (including its chosen gamma).

Usage

glmnet_with_cv(
  formula,
  data,
  glmnet_alpha = c(0.5, 1),
  standardize = TRUE,
  nfolds = 10,
  repeats = 5,
  choose_rule = c("min", "1se"),
  seed = NULL,
  exclude = NULL,
  relaxed = FALSE,
  relax_gamma = NULL,
  family = c("gaussian", "binomial"),
  ...
)

Value

A list of class c("svem_cv","svem_model") with elements:

parms Named numeric vector of coefficients (including "(Intercept)").
glmnet_alpha Numeric vector of alphas searched.
best_alpha Numeric; winning alpha.
best_lambda Numeric; winning lambda.
y_pred In-sample predictions from the returned coefficients (fitted values for Gaussian; probabilities for binomial).
debias_fit For Gaussian, an optional lm(y ~ y_pred) calibration model; NULL otherwise.
y_pred_debiased If debias_fit exists, its fitted values; otherwise NULL.
cv_summary Named list (one element per alpha) of data frames with columns lambda, mean_cvm, sd_cvm, se_combined, n_repeats, idx_min, idx_1se.
formula Original modeling formula.
terms Training terms object with environment set to baseenv().
training_X Training design matrix (without intercept column).
actual_y Training response vector used for glmnet: numeric y for Gaussian, or 0/1 numeric y for binomial.
xlevels Factor and character levels seen during training (for safe prediction).
contrasts Contrasts used for factor predictors during training.
schema List list(feature_names, terms_str, xlevels, contrasts, terms_hash) for deterministic prediction.
note Character vector of notes (for example, dropped rows, intercept-only path, ridge fallback, relaxed-coefficient source).
meta List with fields such as nfolds, repeats, rule, family, relaxed, relax_cv_fallbacks, and cv_object (the final cv.glmnet() object when relaxed = TRUE and keep = TRUE, otherwise NULL).
diagnostics List of simple diagnostics for the selected model, currently including:
- k_final: number of coefficients estimated as nonzero including the intercept.
- k_final_no_intercept: number of nonzero slope coefficients (excludes the intercept).
family Character scalar giving the resolved family ("gaussian" or "binomial"), mirroring meta$family.

Arguments

formula

Model formula.

data

Data frame containing the variables in the model.

glmnet_alpha

Numeric vector of Elastic Net mixing parameters (alphas) in [0,1]; default c(0.5, 1). When relaxed = TRUE, any alpha = 0 (ridge) is dropped with a warning.

standardize

Logical passed to glmnet() and cv.glmnet() (default TRUE).

nfolds

Requested number of CV folds (default 10). Internally constrained so that there are at least about 3 observations per fold and at least 5 folds when possible.

repeats

Number of independent CV repeats (default 5). Each repeat reuses the same folds across all alphas for paired comparisons.

choose_rule

Character; how to choose lambda within each alpha:

"min": lambda minimizing the cross-validated criterion.
"1se": largest lambda within 1 combined SE of the minimum, where the SE includes both within- and between-repeat variability.

Default is "min". In small-mixture simulations, the 1-SE rule tended to increase RMSE on held-out data, so "min" is used as the default here.

seed

Optional integer seed for reproducible fold IDs (and the ridge fallback, if used).

exclude

Optional vector or function for glmnet's exclude= argument. If a function, cv.glmnet() applies it inside each training fold (requires glmnet >= 4.1-2).

relaxed

Logical; if TRUE, call glmnet() and cv.glmnet() with relax = TRUE and optionally a gamma path (default FALSE). If cv.glmnet(relax = TRUE) fails for a particular repeat/alpha, the function retries that fit without relaxation; the number of such fallbacks is recorded in meta$relax_cv_fallbacks.

relax_gamma

Optional numeric vector passed as gamma= to glmnet() and cv.glmnet() when relaxed = TRUE. If NULL, glmnet's internal default gamma grid is used.

family

Model family: either "gaussian" or "binomial", or the corresponding stats::gaussian() or stats::binomial() family objects with canonical links. For Gaussian, y must be numeric. For binomial, y must be 0/1 numeric, logical, or a factor with exactly 2 levels (the second level is treated as 1). Non-canonical links are not supported.

...

Additional arguments forwarded to both cv.glmnet() and glmnet(), for example: weights, parallel, type.measure, intercept, maxit, lower.limits, upper.limits, penalty.factor, offset, standardize.response, keep, and so on. If family is supplied here, it is ignored in favor of the explicit family argument.

Acknowledgments

OpenAI's GPT models (o1-preview through GPT-5 Pro) were used to assist with coding and roxygen documentation; all content was reviewed and finalized by the author.

Details

This function is a convenience wrapper around glmnet() and cv.glmnet() that returns an object in the same structural format as SVEMnet() (class "svem_model"). It is intended for:

direct comparison of standard cross-validated glmnet fits to SVEMnet models using the same prediction and schema tools, or
users who want a repeated-cv.glmnet() workflow without any SVEM weighting or bootstrap ensembling.

It is not called internally by the SVEM bootstrap routines.

The basic workflow is:

For each alpha in glmnet_alpha, generate a set of CV fold IDs (shared across alphas and repeats).
For that alpha, run repeats independent cv.glmnet() fits, align the lambda paths, and aggregate the CV curves.
At each lambda, compute a combined SE that accounts for both within-repeat and between-repeat variability.
Apply choose_rule ("min" or "1se") to select lambda for that alpha, then choose the best alpha by comparing these per-alpha scores.

Special cases and fallbacks:

If there are no predictors after model.matrix() (an intercept-only model), the function returns an intercept-only fit without calling glmnet(), along with a minimal schema for safe prediction.
If all cv.glmnet() attempts fail for every alpha (a rare edge case), the function falls back to a manual ridge (alpha = 0) CV search over a fixed lambda grid and returns the best ridge solution. For Gaussian models this search uses a mean-squared-error criterion; for binomial models it uses a negative log-likelihood (deviance-equivalent) criterion.

Family-specific behavior:

For the Gaussian family, an optional calibration lm(y ~ y_pred) is fit on the training data (when there is sufficient variation), and both y_pred and y_pred_debiased are stored.
For the binomial family, y_pred is always on the probability (response) scale and debiasing is not applied. Both the primary cross-validation and any ridge fallback use deviance-style criteria (binomial negative log-likelihood) rather than squared error.

Design-matrix schema and contrasts:

The training terms are stored with environment set to baseenv().
Factor and character levels are recorded in xlevels for safe prediction.
Per-factor contrasts are stored in contrasts, normalized so that any contrasts recorded as character names are converted back to contrast functions at prediction time.

The returned object inherits classes "svem_cv" and "svem_model" and is designed to be compatible with SVEMnet prediction and schema utilities. It is a standalone, standard glmnet CV workflow that does not use SVEM-style bootstrap weighting or ensembling.

References

Gotwalt, C., & Ramsey, P. (2018). Model Validation Strategies for Designed Experiments Using Bootstrapping Techniques With Applications to Biopharmaceuticals. JMP Discovery Conference. https://community.jmp.com/t5/Abstracts/Model-Validation-Strategies-for-Designed-Experiments-Using/ev-p/849873/redirect_from_archived_page/true

Karl, A. T. (2024). A randomized permutation whole-model test heuristic for Self-Validated Ensemble Models (SVEM). Chemometrics and Intelligent Laboratory Systems, 249, 105122. tools:::Rd_expr_doi("10.1016/j.chemolab.2024.105122")

Karl, A., Wisnowski, J., & Rushing, H. (2022). JMP Pro 17 Remedies for Practical Struggles with Mixture Experiments. JMP Discovery Conference. tools:::Rd_expr_doi("10.13140/RG.2.2.34598.40003/1")

Lemkus, T., Gotwalt, C., Ramsey, P., & Weese, M. L. (2021). Self-Validated Ensemble Models for Design of Experiments. Chemometrics and Intelligent Laboratory Systems, 219, 104439. tools:::Rd_expr_doi("10.1016/j.chemolab.2021.104439")

Xu, L., Gotwalt, C., Hong, Y., King, C. B., & Meeker, W. Q. (2020). Applications of the Fractional-Random-Weight Bootstrap. The American Statistician, 74(4), 345–358. tools:::Rd_expr_doi("10.1080/00031305.2020.1731599")

Ramsey, P., Gaudard, M., & Levin, W. (2021). Accelerating Innovation with Space Filling Mixture Designs, Neural Networks and SVEM. JMP Discovery Conference. https://community.jmp.com/t5/Abstracts/Accelerating-Innovation-with-Space-Filling-Mixture-Designs/ev-p/756841

Ramsey, P., & Gotwalt, C. (2018). Model Validation Strategies for Designed Experiments Using Bootstrapping Techniques With Applications to Biopharmaceuticals. JMP Discovery Conference - Europe. https://community.jmp.com/t5/Abstracts/Model-Validation-Strategies-for-Designed-Experiments-Using/ev-p/849647/redirect_from_archived_page/true

Ramsey, P., Levin, W., Lemkus, T., & Gotwalt, C. (2021). SVEM: A Paradigm Shift in Design and Analysis of Experiments. JMP Discovery Conference - Europe. https://community.jmp.com/t5/Abstracts/SVEM-A-Paradigm-Shift-in-Design-and-Analysis-of-Experiments-2021/ev-p/756634

Ramsey, P., & McNeill, P. (2023). CMC, SVEM, Neural Networks, DOE, and Complexity: It's All About Prediction. JMP Discovery Conference.

Friedman, J. H., Hastie, T., and Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22.

Meinshausen, N. (2007). Relaxed Lasso. Computational Statistics & Data Analysis, 52(1), 374-393.

Kish, L. (1965). Survey Sampling. Wiley.

Lumley, T. (2004). Analysis of complex survey samples. Journal of Statistical Software, 9(1), 1–19.

Lumley, T. and Scott, A. (2015). AIC and BIC for modelling with complex survey data. Journal of Survey Statistics and Methodology, 3(1), 1–18.

Examples

Run this code

set.seed(123)
n <- 100; p <- 10
X <- matrix(rnorm(n * p), n, p)
beta <- c(1, -1, rep(0, p - 2))
y <- as.numeric(X %*% beta + rnorm(n))
df_ex <- data.frame(y = y, X)
colnames(df_ex) <- c("y", paste0("x", 1:p))

# Gaussian example, v1-like behavior: choose_rule = "min"
fit_min <- glmnet_with_cv(
  y ~ ., df_ex,
  glmnet_alpha = 1,
  nfolds = 5,
  repeats = 1,
  choose_rule = "min",
  seed = 42,
  family = "gaussian"
)

# Gaussian example, relaxed path with gamma search
fit_relax <- glmnet_with_cv(
  y ~ ., df_ex,
  glmnet_alpha = 1,
  nfolds = 5,
  repeats = 1,
  relaxed = TRUE,
  seed = 42,
  family = "gaussian"
)

# Binomial example (numeric 0/1 response)
set.seed(456)
n2 <- 150; p2 <- 8
X2 <- matrix(rnorm(n2 * p2), n2, p2)
beta2 <- c(1.0, -1.5, rep(0, p2 - 2))
linpred <- as.numeric(X2 %*% beta2)
prob <- plogis(linpred)
y_bin <- rbinom(n2, size = 1, prob = prob)
df_bin <- data.frame(y = y_bin, X2)
colnames(df_bin) <- c("y", paste0("x", 1:p2))

fit_bin <- glmnet_with_cv(
  y ~ ., df_bin,
  glmnet_alpha = c(0.5, 1),
  nfolds = 5,
  repeats = 2,
  seed = 99,
  family = "binomial"
)

Run the code above in your browser using DataLab