Repeated K-fold cross-validation over a per-alpha lambda path, with a
combined 1-SE rule across repeats. Preserves fields expected by
predict.svem_model() and internal prediction helpers. Optionally uses
glmnet's built-in relaxed elastic net for both the warm-start path and
each CV fit. When relaxed = TRUE, the final coefficients are taken
from a cv.glmnet() object at the chosen lambda so that the returned
model reflects the relaxed solution (including its chosen gamma).
glmnet_with_cv(
formula,
data,
glmnet_alpha = c(0.5, 1),
standardize = TRUE,
nfolds = 10,
repeats = 5,
choose_rule = c("min", "1se"),
seed = NULL,
exclude = NULL,
relaxed = FALSE,
relax_gamma = NULL,
family = c("gaussian", "binomial"),
...
)A list of class c("svem_cv","svem_model") with elements:
parms Named numeric vector of coefficients (including
"(Intercept)").
glmnet_alpha Numeric vector of alphas searched.
best_alpha Numeric; winning alpha.
best_lambda Numeric; winning lambda.
y_pred In-sample predictions from the returned coefficients
(fitted values for Gaussian; probabilities for binomial).
debias_fit For Gaussian, an optional lm(y ~ y_pred)
calibration model; NULL otherwise.
y_pred_debiased If debias_fit exists, its fitted
values; otherwise NULL.
cv_summary Named list (one element per alpha) of data frames
with columns lambda, mean_cvm, sd_cvm,
se_combined, n_repeats, idx_min,
idx_1se.
formula Original modeling formula.
terms Training terms object with environment set to
baseenv().
training_X Training design matrix (without intercept column).
actual_y Training response vector used for glmnet:
numeric y for Gaussian, or 0/1 numeric y for
binomial.
xlevels Factor and character levels seen during training
(for safe prediction).
contrasts Contrasts used for factor predictors during
training.
schema List
list(feature_names, terms_str, xlevels, contrasts, terms_hash)
for deterministic prediction.
note Character vector of notes (for example, dropped rows,
intercept-only path, ridge fallback, relaxed-coefficient source).
meta List with fields such as nfolds, repeats,
rule, family, relaxed,
relax_cv_fallbacks, and cv_object (the final
cv.glmnet() object when relaxed = TRUE and
keep = TRUE, otherwise NULL).
diagnostics List of simple diagnostics for the selected
model, currently including:
k_final: number of coefficients estimated as
nonzero including the intercept.
k_final_no_intercept: number of nonzero
slope coefficients (excludes the intercept).
family Character scalar giving the resolved family
("gaussian" or "binomial"), mirroring
meta$family.
Model formula.
Data frame containing the variables in the model.
Numeric vector of Elastic Net mixing parameters
(alphas) in [0,1]; default c(0.5, 1). When
relaxed = TRUE, any alpha = 0 (ridge) is dropped with a
warning.
Logical passed to glmnet() and cv.glmnet()
(default TRUE).
Requested number of CV folds (default 10). Internally
constrained so that there are at least about 3 observations per fold and
at least 5 folds when possible.
Number of independent CV repeats (default 5). Each
repeat reuses the same folds across all alphas for paired comparisons.
Character; how to choose lambda within each alpha:
"min": lambda minimizing the cross-validated criterion.
"1se": largest lambda within 1 combined SE of the minimum,
where the SE includes both within- and between-repeat variability.
Default is "min". In small-mixture simulations, the 1-SE rule
tended to increase RMSE on held-out data, so "min" is used as the
default here.
Optional integer seed for reproducible fold IDs (and the ridge fallback, if used).
Optional vector or function for glmnet's
exclude= argument. If a function, cv.glmnet() applies it
inside each training fold (requires glmnet >= 4.1-2).
Logical; if TRUE, call glmnet() and
cv.glmnet() with relax = TRUE and optionally a
gamma path (default FALSE). If
cv.glmnet(relax = TRUE) fails for a particular repeat/alpha, the
function retries that fit without relaxation; the number of such
fallbacks is recorded in meta$relax_cv_fallbacks.
Optional numeric vector passed as gamma= to
glmnet() and cv.glmnet() when relaxed = TRUE. If
NULL, glmnet's internal default gamma grid is used.
Model family: either "gaussian" or "binomial",
or the corresponding stats::gaussian() or stats::binomial()
family objects with canonical links. For Gaussian, y must be
numeric. For binomial, y must be 0/1 numeric, logical, or a factor
with exactly 2 levels (the second level is treated as 1). Non-canonical
links are not supported.
Additional arguments forwarded to both cv.glmnet() and
glmnet(), for example: weights, parallel,
type.measure, intercept, maxit,
lower.limits, upper.limits, penalty.factor,
offset, standardize.response, keep, and so on. If
family is supplied here, it is ignored in favor of the explicit
family argument.
OpenAI's GPT models (o1-preview through GPT-5 Pro) were used to assist with coding and roxygen documentation; all content was reviewed and finalized by the author.
This function is a convenience wrapper around glmnet() and
cv.glmnet() that returns an object in the same structural format as
SVEMnet() (class "svem_model"). It is intended for:
direct comparison of standard cross-validated glmnet fits to SVEMnet models using the same prediction and schema tools, or
users who want a repeated-cv.glmnet() workflow without any
SVEM weighting or bootstrap ensembling.
It is not called internally by the SVEM bootstrap routines.
The basic workflow is:
For each alpha in glmnet_alpha, generate a set of CV
fold IDs (shared across alphas and repeats).
For that alpha, run repeats independent cv.glmnet()
fits, align the lambda paths, and aggregate the CV curves.
At each lambda, compute a combined SE that accounts for both within-repeat and between-repeat variability.
Apply choose_rule ("min" or "1se") to select
lambda for that alpha, then choose the best alpha by comparing these
per-alpha scores.
Special cases and fallbacks:
If there are no predictors after model.matrix() (an
intercept-only model), the function returns an intercept-only fit
without calling glmnet(), along with a minimal schema for
safe prediction.
If all cv.glmnet() attempts fail for every alpha (a rare
edge case), the function falls back to a manual ridge
(alpha = 0) CV search over a fixed lambda grid and returns
the best ridge solution. For Gaussian models this search uses a
mean-squared-error criterion; for binomial models it uses a
negative log-likelihood (deviance-equivalent) criterion.
Family-specific behavior:
For the Gaussian family, an optional calibration lm(y ~ y_pred)
is fit on the training data (when there is sufficient variation), and
both y_pred and y_pred_debiased are stored.
For the binomial family, y_pred is always on the probability
(response) scale and debiasing is not applied. Both the primary
cross-validation and any ridge fallback use deviance-style criteria
(binomial negative log-likelihood) rather than squared error.
Design-matrix schema and contrasts:
The training terms are stored with environment set to
baseenv().
Factor and character levels are recorded in xlevels for
safe prediction.
Per-factor contrasts are stored in contrasts, normalized
so that any contrasts recorded as character names are converted
back to contrast functions at prediction time.
The returned object inherits classes "svem_cv" and "svem_model"
and is designed to be compatible with SVEMnet prediction and schema
utilities. It is a standalone, standard glmnet CV workflow that does not use
SVEM-style bootstrap weighting or ensembling.
Gotwalt, C., & Ramsey, P. (2018). Model Validation Strategies for Designed Experiments Using Bootstrapping Techniques With Applications to Biopharmaceuticals. JMP Discovery Conference. https://community.jmp.com/t5/Abstracts/Model-Validation-Strategies-for-Designed-Experiments-Using/ev-p/849873/redirect_from_archived_page/true
Karl, A. T. (2024). A randomized permutation whole-model test heuristic for Self-Validated Ensemble Models (SVEM). Chemometrics and Intelligent Laboratory Systems, 249, 105122. tools:::Rd_expr_doi("10.1016/j.chemolab.2024.105122")
Karl, A., Wisnowski, J., & Rushing, H. (2022). JMP Pro 17 Remedies for Practical Struggles with Mixture Experiments. JMP Discovery Conference. tools:::Rd_expr_doi("10.13140/RG.2.2.34598.40003/1")
Lemkus, T., Gotwalt, C., Ramsey, P., & Weese, M. L. (2021). Self-Validated Ensemble Models for Design of Experiments. Chemometrics and Intelligent Laboratory Systems, 219, 104439. tools:::Rd_expr_doi("10.1016/j.chemolab.2021.104439")
Xu, L., Gotwalt, C., Hong, Y., King, C. B., & Meeker, W. Q. (2020). Applications of the Fractional-Random-Weight Bootstrap. The American Statistician, 74(4), 345–358. tools:::Rd_expr_doi("10.1080/00031305.2020.1731599")
Ramsey, P., Gaudard, M., & Levin, W. (2021). Accelerating Innovation with Space Filling Mixture Designs, Neural Networks and SVEM. JMP Discovery Conference. https://community.jmp.com/t5/Abstracts/Accelerating-Innovation-with-Space-Filling-Mixture-Designs/ev-p/756841
Ramsey, P., & Gotwalt, C. (2018). Model Validation Strategies for Designed Experiments Using Bootstrapping Techniques With Applications to Biopharmaceuticals. JMP Discovery Conference - Europe. https://community.jmp.com/t5/Abstracts/Model-Validation-Strategies-for-Designed-Experiments-Using/ev-p/849647/redirect_from_archived_page/true
Ramsey, P., Levin, W., Lemkus, T., & Gotwalt, C. (2021). SVEM: A Paradigm Shift in Design and Analysis of Experiments. JMP Discovery Conference - Europe. https://community.jmp.com/t5/Abstracts/SVEM-A-Paradigm-Shift-in-Design-and-Analysis-of-Experiments-2021/ev-p/756634
Ramsey, P., & McNeill, P. (2023). CMC, SVEM, Neural Networks, DOE, and Complexity: It's All About Prediction. JMP Discovery Conference.
Friedman, J. H., Hastie, T., and Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22.
Meinshausen, N. (2007). Relaxed Lasso. Computational Statistics & Data Analysis, 52(1), 374-393.
Kish, L. (1965). Survey Sampling. Wiley.
Lumley, T. (2004). Analysis of complex survey samples. Journal of Statistical Software, 9(1), 1–19.
Lumley, T. and Scott, A. (2015). AIC and BIC for modelling with complex survey data. Journal of Survey Statistics and Methodology, 3(1), 1–18.
set.seed(123)
n <- 100; p <- 10
X <- matrix(rnorm(n * p), n, p)
beta <- c(1, -1, rep(0, p - 2))
y <- as.numeric(X %*% beta + rnorm(n))
df_ex <- data.frame(y = y, X)
colnames(df_ex) <- c("y", paste0("x", 1:p))
# Gaussian example, v1-like behavior: choose_rule = "min"
fit_min <- glmnet_with_cv(
y ~ ., df_ex,
glmnet_alpha = 1,
nfolds = 5,
repeats = 1,
choose_rule = "min",
seed = 42,
family = "gaussian"
)
# Gaussian example, relaxed path with gamma search
fit_relax <- glmnet_with_cv(
y ~ ., df_ex,
glmnet_alpha = 1,
nfolds = 5,
repeats = 1,
relaxed = TRUE,
seed = 42,
family = "gaussian"
)
# Binomial example (numeric 0/1 response)
set.seed(456)
n2 <- 150; p2 <- 8
X2 <- matrix(rnorm(n2 * p2), n2, p2)
beta2 <- c(1.0, -1.5, rep(0, p2 - 2))
linpred <- as.numeric(X2 %*% beta2)
prob <- plogis(linpred)
y_bin <- rbinom(n2, size = 1, prob = prob)
df_bin <- data.frame(y = y_bin, X2)
colnames(df_bin) <- c("y", paste0("x", 1:p2))
fit_bin <- glmnet_with_cv(
y ~ ., df_bin,
glmnet_alpha = c(0.5, 1),
nfolds = 5,
repeats = 2,
seed = 99,
family = "binomial"
)
Run the code above in your browser using DataLab