plot.vsel: Plot summary statistics of a variable selection

Description

This is the plot() method for vsel objects (returned by varsel() or cv_varsel()).

Usage

# S3 method for vsel
plot(
  x,
  nterms_max = NULL,
  stats = "elpd",
  deltas = FALSE,
  alpha = 2 * pnorm(-1),
  baseline = if (!inherits(x$refmodel, "datafit")) "ref" else "best",
  thres_elpd = NA,
  resp_oscale = TRUE,
  ...
)

Arguments

x

An object of class vsel (returned by varsel() or cv_varsel()).

nterms_max

Maximum submodel size for which the statistics are calculated. Using NULL is effectively the same as using length(solution_terms(object)). Note that nterms_max does not count the intercept, so use nterms_max = 0 for the intercept-only model. For plot.vsel(), nterms_max must be at least 1.

stats

One or more character strings determining which performance statistics (i.e., utilities or losses) to calculate. Available statistics are:

"elpd": (expected) sum of log predictive densities.
"mlpd": mean log predictive density, that is, "elpd" divided by the number of observations.
"mse": mean squared error (only available in the situations mentioned in section "Details" below).
"rmse": root mean squared error (only available in the situations mentioned in section "Details" below). For the corresponding standard error and lower and upper confidence interval bounds, bootstrapping is used.
"acc" (or its alias, "pctcorr"): classification accuracy (only available in the situations mentioned in section "Details" below).
"auc": area under the ROC curve (only available in the situations mentioned in section "Details" below). For the corresponding standard error and lower and upper confidence interval bounds, bootstrapping is used.

deltas

If TRUE, the submodel statistics are estimated as differences from the baseline model (see argument baseline). With a "difference from the baseline model", we mean to take the submodel statistic minus the baseline model statistic (not the other way round).

alpha

A number determining the (nominal) coverage 1 - alpha of the normal-approximation (or bootstrap; see argument stats) confidence intervals. For example, in case of the normal approximation, alpha = 2 * pnorm(-1) corresponds to a confidence interval stretching by one standard error on either side of the point estimate.

baseline

For summary.vsel(): Only relevant if deltas is TRUE. For plot.vsel(): Always relevant. Either "ref" or "best", indicating whether the baseline is the reference model or the best submodel found (in terms of stats[1]), respectively.

thres_elpd

Only relevant if any(stats %in% c("elpd", "mlpd")). The threshold for the ELPD difference (taking the submodel's ELPD minus the baseline model's ELPD) above which the submodel's ELPD is considered to be close enough to the baseline model's ELPD. An equivalent rule is applied in case of the MLPD. See suggest_size() for a formalization. Supplying NA deactivates this.

resp_oscale

Only relevant for the latent projection. A single logical value indicating whether to calculate the performance statistics on response scale (TRUE) or on latent scale (FALSE).

...

Arguments passed to the internal function which is used for bootstrapping (if applicable; see argument stats). Currently, relevant arguments are B (the number of bootstrap samples, defaulting to 2000) and seed (see set.seed(), defaulting to sample.int(.Machine$integer.max, 1), but can also be NA to not call set.seed() at all).

Horizontal lines

As long as the reference model's performance is computable, it is always shown in the plot as a dashed red horizontal line. If baseline = "best", the baseline model's performance is shown as a dotted black horizontal line. If !is.na(thres_elpd) and any(stats %in% c("elpd", "mlpd")), the value supplied to thres_elpd (which is automatically adapted internally in case of the MLPD or deltas = FALSE) is shown as a dot-dashed gray horizontal line for the reference model and, if baseline = "best", as a long-dashed green horizontal line for the baseline model.

Details

The stats options "mse" and "rmse" are only available for:

the traditional projection,
the latent projection with resp_oscale = FALSE,
the latent projection with resp_oscale = TRUE in combination with <refmodel>$family$cats being NULL.

The stats option "acc" (= "pctcorr") is only available for:

the binomial() family in case of the traditional projection,
all families in case of the augmented-data projection,
the binomial() family (on the original response scale) in case of the latent projection with resp_oscale = TRUE in combination with <refmodel>$family$cats being NULL,
all families (on the original response scale) in case of the latent projection with resp_oscale = TRUE in combination with <refmodel>$family$cats being not NULL.

The stats option "auc" is only available for:

the binomial() family in case of the traditional projection,
the binomial() family (on the original response scale) in case of the latent projection with resp_oscale = TRUE in combination with <refmodel>$family$cats being NULL.

Examples

Run this code

if (requireNamespace("rstanarm", quietly = TRUE)) {
  # Data:
  dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)

  # The "stanreg" fit which will be used as the reference model (with small
  # values for `chains` and `iter`, but only for technical reasons in this
  # example; this is not recommended in general):
  fit <- rstanarm::stan_glm(
    y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
    QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
  )

  # Variable selection (here without cross-validation and with small values
  # for `nterms_max`, `nclusters`, and `nclusters_pred`, but only for the
  # sake of speed in this example; this is not recommended in general):
  vs <- varsel(fit, nterms_max = 3, nclusters = 5, nclusters_pred = 10,
               seed = 5555)
  print(plot(vs))
}

Run the code above in your browser using DataLab