variable_selection_bsw: Variable Selection (Forward or Backward) for models of `BSW()`

Description

Performs forward or backward variable selection based on Wald test p-values for models estimated using bsw(). In each step, a new model is fitted using bsw(), and variables are added or removed based on the significance level defined by alpha.

Usage

variable_selection_bsw(model, selection = c("backward", "forward"), alpha = 0.157,
                              print_models = FALSE, maxit = NULL, conswitch = NULL)

Value

An object of class "bsw_selection", which is a list containing:

final_model: An object of class bsw representing the final model selected through the variable selection process.
model_list: A list of intermediate bsw model objects fitted during each step of the selection.
skipped_models: A named list of models that failed to converge and were skipped during the selection. Each entry includes the attempted formula.
final_formula: The final model formula used in the last step.
EPV: Estimated events-per-variable (EPV) of the final model, used as a diagnostic for model stability.
warnings: Optional warning messages about convergence issues or model stability (e.g., low EPV or skipped variables).

Arguments

model

A model object from bsw() with full data and formula.

selection

Character string, either "backward" or "forward". Determines the direction of model selection. If not specified, backward elimination is performed by default.

alpha

P-value threshold for variable inclusion (forward) or exclusion (backward). Defaults to 0.157, as recommended by Heinze, G., Wallisch, C., & Dunkler, D. (2018).

print_models

Logical; whether to print each model during selection. Defaults to FALSE.

maxit

Maximum number of iterations in the bsw() algorithm. If NULL, defaults to 200L or value from original model call.

conswitch

Specifies how the constraint matrix is constructed:

1 (default): Generates all possible combinations of minimum and maximum values for the predictors (excluding the intercept), resulting in \(2^{m-1}\) constraints. This formulation constrains model predictions within the observed data range, making it suitable for both risk factor identification and prediction (prognosis).

0

Uses the raw design matrix x as the constraint matrix, resulting in \(n\) constraints. This is primarily suitable for identifying risk factors, but not for prediction tasks, as predictions are not bounded to realistic ranges.

Author

Julius Johannes Weise, Thomas Wolf, Stefan Wagenpfeil

References

Heinze, G., Wallisch, C., & Dunkler, D. (2018). Variable selection – A review and recommendations for the practicing statistician. Biometrical Journal, 60(3), 431–449.

Examples

Run this code

set.seed(123)
x1 <- rnorm(500, 50, 10)
x2 <- rnorm(500, 30, 5)
x3 <- rnorm(500, 40, 8)
x4 <- rnorm(500, 60, 12)
logit <- (-4 + x1 * 0.04 + x3 * 0.04)
p <- 1 / (1 + exp(-logit))
y <- rbinom(500, 1, p)
df <- data.frame(y, x1, x2, x3, x4)
fit <- bsw(formula = y ~ x1 + x2 + x3 + x4, data = df)
result <- variable_selection_bsw(fit, selection = "forward", alpha = 0.1)
print(result)

Run the code above in your browser using DataLab