Run the search part and the evaluation part for a projection predictive
variable selection. The search part determines the solution path, i.e., the
best submodel for each submodel size (number of predictor terms). The
evaluation part determines the predictive performance of the submodels along
the solution path. In contrast to varsel()
, cv_varsel()
performs a
cross-validation (CV) by running the search part with the training data of
each CV fold separately (an exception is explained in section "Note" below)
and running the evaluation part on the corresponding test set of each CV
fold.
cv_varsel(object, ...)# S3 method for default
cv_varsel(object, ...)
# S3 method for refmodel
cv_varsel(
object,
method = NULL,
cv_method = if (!inherits(object, "datafit")) "LOO" else "kfold",
ndraws = NULL,
nclusters = 20,
ndraws_pred = 400,
nclusters_pred = NULL,
refit_prj = !inherits(object, "datafit"),
nterms_max = NULL,
penalty = NULL,
verbose = TRUE,
nloo = NULL,
K = if (!inherits(object, "datafit")) 5 else 10,
lambda_min_ratio = 1e-05,
nlambda = 150,
thresh = 1e-06,
regul = 1e-04,
validate_search = TRUE,
seed = sample.int(.Machine$integer.max, 1),
search_terms = NULL,
...
)
An object of class vsel
. The elements of this object are not meant
to be accessed directly but instead via helper functions (see the main
vignette and projpred-package).
An object of class refmodel
(returned by get_refmodel()
or
init_refmodel()
) or an object that can be passed to argument object
of
get_refmodel()
.
Arguments passed to get_refmodel()
as well as to the divergence
minimizer (during a forward search and also during the evaluation part, but
the latter only if refit_prj
is TRUE
).
The method for the search part. Possible options are "L1"
for
L1 search and "forward"
for forward search. If NULL
, then internally,
"L1"
is used, except if (i) the reference model has multilevel or
additive terms, (ii) if !is.null(search_terms)
, or (iii) if the
augmented-data projection is used. See also section "Details" below.
The CV method, either "LOO"
or "kfold"
. In the "LOO"
case, a Pareto-smoothed importance sampling leave-one-out CV (PSIS-LOO CV)
is performed, which avoids refitting the reference model nloo
times (in
contrast to a standard LOO CV). In the "kfold"
case, a \(K\)-fold CV is
performed.
Number of posterior draws used in the search part. Ignored if
nclusters
is not NULL
or in case of L1 search (because L1 search always
uses a single cluster). If both (nclusters
and ndraws
) are NULL
, the
number of posterior draws from the reference model is used for ndraws
.
See also section "Details" below.
Number of clusters of posterior draws used in the search
part. Ignored in case of L1 search (because L1 search always uses a single
cluster). For the meaning of NULL
, see argument ndraws
. See also
section "Details" below.
Only relevant if refit_prj
is TRUE
. Number of
posterior draws used in the evaluation part. Ignored if nclusters_pred
is
not NULL
. If both (nclusters_pred
and ndraws_pred
) are NULL
, the
number of posterior draws from the reference model is used for
ndraws_pred
. See also section "Details" below.
Only relevant if refit_prj
is TRUE
. Number of
clusters of posterior draws used in the evaluation part. For the meaning of
NULL
, see argument ndraws_pred
. See also section "Details" below.
A single logical value indicating whether to fit the
submodels along the solution path again (TRUE
) or to retrieve their fits
from the search part (FALSE
) before using those (re-)fits in the
evaluation part.
Maximum number of predictor terms until which the search is
continued. If NULL
, then min(19, D)
is used where D
is the number of
terms in the reference model (or in search_terms
, if supplied). Note that
nterms_max
does not count the intercept, so use nterms_max = 0
for the
intercept-only model. (Correspondingly, D
above does not count the
intercept.)
Only relevant for L1 search. A numeric vector determining the
relative penalties or costs for the predictors. A value of 0
means that
those predictors have no cost and will therefore be selected first, whereas
Inf
means those predictors will never be selected. If NULL
, then 1
is
used for each predictor.
A single logical value indicating whether to print out additional information during the computations.
Caution: Still experimental. Only relevant if cv_method = "LOO"
. Number of subsampled LOO CV folds, i.e., number of observations
used for the LOO CV (anything between 1 and the original number of
observations). Smaller values lead to faster computation but higher
uncertainty in the evaluation part. If NULL
, all observations are used,
but for faster experimentation, one can set this to a smaller value.
Only relevant if cv_method = "kfold"
and if the reference model
was created with cvfits
being NULL
(which is the case for
get_refmodel.stanreg()
and brms::get_refmodel.brmsfit()
). Number of
folds in \(K\)-fold CV.
Only relevant for L1 search. Ratio between the smallest and largest lambda in the L1-penalized search. This parameter essentially determines how long the search is carried out, i.e., how large submodels are explored. No need to change this unless the program gives a warning about this.
Only relevant for L1 search. Number of values in the lambda grid for L1-penalized search. No need to change this unless the program gives a warning about this.
Only relevant for L1 search. Convergence threshold when computing the L1 path. Usually, there is no need to change this.
A number giving the amount of ridge regularization when projecting onto (i.e., fitting) submodels which are GLMs. Usually there is no need for regularization, but sometimes we need to add some regularization to avoid numerical problems.
Only relevant if cv_method = "LOO"
. A single logical
value indicating whether to cross-validate also the search part, i.e.,
whether to run the search separately for each CV fold (TRUE
) or not
(FALSE
). We strongly do not recommend setting this to FALSE
, because
this is known to bias the predictive performance estimates of the selected
submodels. However, setting this to FALSE
can sometimes be useful because
comparing the results to the case where this argument is TRUE
gives an
idea of how strongly the variable selection is (over-)fitted to the data
(the difference corresponds to the search degrees of freedom or the
effective number of parameters introduced by the search).
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument seed
of
set.seed()
, but can also be NA
to not call set.seed()
at all. Here,
this seed is used for clustering the reference model's posterior draws (if
!is.null(nclusters)
or !is.null(nclusters_pred)
), for subsampling LOO
CV folds (if nloo
is smaller than the number of observations), for
sampling the folds in K-fold CV, and for drawing new group-level effects
when predicting from a multilevel submodel (however, not yet in case of a
GAMM).
Only relevant for forward search. A custom character
vector of predictor term blocks to consider for the search. Section
"Details" below describes more precisely what "predictor term block" means.
The intercept ("1"
) is always included internally via union()
, so
there's no difference between including it explicitly or omitting it. The
default search_terms
considers all the terms in the reference model's
formula.
Arguments ndraws
, nclusters
, nclusters_pred
, and ndraws_pred
are automatically truncated at the number of posterior draws in the
reference model (which is 1
for datafit
s). Using less draws or clusters
in ndraws
, nclusters
, nclusters_pred
, or ndraws_pred
than posterior
draws in the reference model may result in slightly inaccurate projection
performance. Increasing these arguments affects the computation time
linearly.
For argument method
, there are some restrictions: For a reference model
with multilevel or additive formula terms or a reference model set up for
the augmented-data projection, only the forward search is available.
Furthermore, argument search_terms
requires a forward search to take
effect.
L1 search is faster than forward search, but forward search may be more accurate. Furthermore, forward search may find a sparser model with comparable performance to that found by L1 search, but it may also start overfitting when more predictors are added.
An L1 search may select interaction terms before the corresponding main terms are selected. If this is undesired, choose the forward search instead.
The elements of the search_terms
character vector don't need to be
individual predictor terms. Instead, they can be building blocks consisting
of several predictor terms connected by the +
symbol. To understand how
these building blocks work, it is important to know how projpred's
forward search works: It starts with an empty vector chosen
which will
later contain already selected predictor terms. Then, the search iterates
over model sizes \(j \in \{1, ..., J\}\). The candidate
models at model size \(j\) are constructed from those elements from
search_terms
which yield model size \(j\) when combined with the
chosen
predictor terms. Note that sometimes, there may be no candidate
models for model size \(j\). Also note that internally, search_terms
is
expanded to include the intercept ("1"
), so the first step of the search
(model size 1) always consists of the intercept-only model as the only
candidate.
As a search_terms
example, consider a reference model with formula y ~ x1 + x2 + x3
. Then, to ensure that x1
is always included in the
candidate models, specify search_terms = c("x1", "x1 + x2", "x1 + x3", "x1 + x2 + x3")
. This search would start with y ~ 1
as the only
candidate at model size 1. At model size 2, y ~ x1
would be the only
candidate. At model size 3, y ~ x1 + x2
and y ~ x1 + x3
would be the
two candidates. At the last model size of 4, y ~ x1 + x2 + x3
would be
the only candidate. As another example, to exclude x1
from the search,
specify search_terms = c("x2", "x3", "x2 + x3")
.
Magnusson, Måns, Michael Andersen, Johan Jonasson, and Aki Vehtari. 2019. "Bayesian Leave-One-Out Cross-Validation for Large Data." In Proceedings of the 36th International Conference on Machine Learning, edited by Kamalika Chaudhuri and Ruslan Salakhutdinov, 97:4244--53. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v97/magnusson19a.html.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. "Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC." Statistics and Computing 27 (5): 1413--32. tools:::Rd_expr_doi("10.1007/s11222-016-9696-4").
Vehtari, Aki, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry. 2022. "Pareto Smoothed Importance Sampling." arXiv. tools:::Rd_expr_doi("10.48550/arXiv.1507.02646").
varsel()
if (FALSE) { # identical(Sys.getenv("RUN_EX"), "true")
# Note: The code from this example is not executed when called via example().
# To execute it, you have to copy and paste it manually to the console.
if (requireNamespace("rstanarm", quietly = TRUE)) {
# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
# The "stanreg" fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
)
# Variable selection with cross-validation (with small values
# for `nterms_max`, `nclusters`, and `nclusters_pred`, but only for the
# sake of speed in this example; this is not recommended in general):
cvvs <- cv_varsel(fit, nterms_max = 3, nclusters = 5, nclusters_pred = 10,
seed = 5555)
# Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`,
# and `?solution_terms.vsel` for possible post-processing functions.
}
}
Run the code above in your browser using DataLab