stepwise: Stepwise Regression Model Selection

Description

Performs stepwise regression model selection using various strategies and selection criteria. Supports multiple regression types including linear, logistic, Cox, Poisson, and Gamma regression.

Usage

stepwise(
  formula,
  data,
  type = c("linear", "logit", "cox", "poisson", "gamma", "negbin"),
  strategy = c("forward", "backward", "bidirection", "subset"),
  metric = c("AIC", "AICc", "BIC", "CP", "HQ", "adjRsq", "SL", "SBC", "IC(3/2)", "IC(1)"),
  sle = 0.15,
  sls = 0.15,
  include = NULL,
  test_method_linear = c("Pillai", "Wilks", "Hotelling-Lawley", "Roy"),
  test_method_glm = c("Rao", "LRT"),
  test_method_cox = c("efron", "breslow", "exact"),
  tolerance = 1e-07,
  weight = NULL,
  best_n = 3,
  test_ratio = 0,
  feature_ratio = 1,
  seed = 123,
  num_digits = 6
)

Value

A StepReg class object, which is a structured list containing both the input specifications and the outcomes of the stepwise regression analysis. The key components of this object are detailed below, providing a comprehensive framework for model exploration and validation.

argument A data.frame containing the user-specified settings and parameters used in the analysis, including the initial formula, regression type, selection strategy, chosen metrics, significance levels (sle/sls), tolerance threshold, test method, and other control parameters.
variable A data.frame containing information about all variables in the model, including variable names, data types (numeric, factor, etc.), and their roles (Dependent/Independent) in the model.
performance A data.frame providing detailed performance metrics for the selected models across different strategies and metrics. For both training and test datasets (when test_ratio < 1), the output includes model-specific performance indicators:
- For linear, poisson, gamma, and negative binomial regression:
  - adj_r2_train/adj_r2_test: Adjusted R-squared measures the proportion of variance explained by the model, adjusted for the number of predictors. Values range from 0 to 1, with higher values indicating better model fit. A good model should have high adjusted R-squared on both training and test data, with minimal difference between them. Large differences suggest overfitting.
  - mse_train/mse_test: Mean Squared Error measures the average squared difference between predicted and actual values. Lower values indicate better model performance. The test MSE should be close to training MSE; significantly higher test MSE suggests overfitting.
  - mae_train/mae_test: Mean Absolute Error measures the average absolute difference between predicted and actual values. Lower values indicate better model performance. Like MSE, test MAE should be close to training MAE to avoid overfitting.
- For logistic regression:
  - accuracy_train/accuracy_test: Accuracy measures the proportion of correct predictions (true positives + true negatives) / total predictions. Values range from 0 to 1, with higher values indicating better classification performance. Test accuracy should be close to training accuracy; large differences suggest overfitting.
  - auc_train/auc_test: Area Under the Curve measures the model's ability to distinguish between classes. Values range from 0.5 (random) to 1.0 (perfect discrimination). AUC > 0.7 is considered acceptable, > 0.8 is good, > 0.9 is excellent. Test AUC should be close to training AUC to avoid overfitting.
  - log_loss_train/log_loss_test: Log Loss (logarithmic loss) penalizes confident wrong predictions more heavily. Lower values indicate better model performance. Values close to 0 are ideal. Test log loss should be close to training log loss; higher test log loss suggests overfitting.
- For Cox regression:
  - c-index_train/c-index_test: Concordance Index (C-index) measures the model's ability to correctly rank survival times. Values range from 0.5 (random) to 1.0 (perfect ranking). C-index > 0.7 is considered acceptable, > 0.8 is good, > 0.9 is excellent. Test C-index should be close to training C-index to avoid overfitting.
  - auc_hc: Harrell's C-index for time-dependent AUC, measuring discrimination at specific time points. Higher values indicate better discrimination ability.
  - auc_uno: Uno's C-index for time-dependent AUC, providing an alternative measure of discrimination that may be more robust to censoring patterns.
  - auc_sh: Schemper and Henderson's C-index for time-dependent AUC, offering another perspective on model discrimination performance.
overview A nested list organized by strategy and metric, containing step-by-step summaries of the model-building process. Each element shows which variables were entered or removed at each step along with the corresponding metric values (e.g., AIC, BIC, SBC).
detail A nested list organized by strategy and metric, providing granular information about each candidate step. This includes which variables were tested, their evaluation statistics, p-values, and whether they were ultimately selected or rejected.
fitted model object within the strategy-specific list A nested list object organized with a first layer representing the selection strategy (e.g., forward, backward, bidirection, subset) and a second layer representing the metric (e.g., AIC, BIC, SBC). For each strategy-metric combination, the function returns fitted model objects that can be further analyzed using S3 generic functions such as summary(), anova(), or coefficients(). These functions adapt to the model type (e.g., coxph, lm, glm) through call-specific methods. Specific statistics can be directly retrieved using the $ operator, such as result$forward$AIC$coefficients. The level of detail in these analyses depends on the model type: the survival package enriches coxph objects with detailed statistics including hazard ratios, standard errors, z-statistics, p-values, and likelihood ratio tests, while base R functions like lm and glm offer basic output with coefficients by default, requiring summary() or anova() to reveal standard errors, t-values, p-values, and R-squared values.

Arguments

formula

A formula object specifying the model structure:

Response variable(s) on left side of ~
Predictor variable(s) on right side of ~
Use + to separate multiple predictors
Use * for main effect and interaction terms
Use : for continuous-nested-within-class variable, make sure class variable is a factor variable, e.g. X:A or A:X means a continuous variable X nested within a factor variable A
Use . to include all variables
Use cbind() for multiple responses
Use 0 or -1 to exclude intercept
Use strata() to include strata variable for Cox regression

data

A data frame containing the variables in the model

type

The type of regression model to fit:

"linear" - Linear regression (default)
"logit" - Logistic regression
"poisson" - Poisson regression
"cox" - Cox proportional hazards regression
"gamma" - Gamma regression
"negbin" - Negative binomial regression

strategy

The model selection strategy:

"forward" - Forward selection (default)
"backward" - Backward elimination
"bidirection" - Bidirectional elimination
"subset" - Best subset selection

metric

The model selection criterion:

"AIC" - Akaike Information Criterion (default)
"AICc" - Corrected AIC
"BIC" - Bayesian Information Criterion
"CP" - Mallows' Cp
"HQ" - Hannan-Quinn criterion
"adjRsq" - Adjusted R-squared
"SL" - Significance Level
"SBC" - Schwarz Bayesian Criterion
"IC(3/2)" - Information Criterion with penalty 3/2
"IC(1)" - Information Criterion with penalty 1

sle

Significance Level to Enter (default: 0.15). A predictor must have p-value < sle to enter the model.

sls

Significance Level to Stay (default: 0.15). A predictor must have p-value < sls to remain in the model.

include

Character vector of predictor variables that must be included in all models.

test_method_linear

Test method for multivariate linear regression:

"Pillai" (default)
"Wilks"
"Hotelling-Lawley"
"Roy"

For univariate regression, F-test is used.

test_method_glm

Test method for GLM models:

"Rao" (default)
"LRT"

Only "Rao" available for subset strategy.

test_method_cox

Test method for Cox regression:

"efron" (default)
"breslow"
"exact"

tolerance

Threshold for detecting multicollinearity (default: 1e-07). Lower values are more strict.

weight

Optional numeric vector of observation weights. Values are coerced to [0,1].

best_n

Maximum number of models to retain for each variable count (default: 3)

test_ratio

Proportion of the dataset allocated for testing (e.g., 0.3, which means 30% of the dataset is used for testing), with the remainder reserved for training, enabling train-test validation.

feature_ratio

Proportion of candidate features sampled uniformly at random during forward selection (default = 1). This randomized selection helps identify the best variables while reducing the risk of overfitting, and is only valid when strategy is "forward".

seed

Seed for random number generation (default: 123), this is only valid when test_ratio or feature_ratio is specified.

num_digits

Number of decimal places to round results (default: 6)

Author

Junhui Li, Kai Hu, Xiaohuan Lu

References

Alsubaihi et al. (2002) Variable strategy in multivariable regression using sas/iml
Darlington (1968) Multiple regression in psychological research and practice
Dharmawansa et al. (2014) Roy's largest root under rank-one alternatives
Hannan & Quinn (1979) The determination of the order of an autoregression
Hotelling (1992) The Generalization of Student's Ratio
Hocking (1976) The analysis and strategy of variables in linear regression
Hurvich & Tsai (1989) Regression and time series model strategy in small samples
Judge (1985) The Theory and practice of econometrics
Mallows (1973) Some comments on cp
Mardia et al. (1979) Multivariate analysis
Mckeon (1974) F approximations to the distribution of hotelling's t20
Mcquarrie & Tsai (1998) Regression and Time Series Model strategy
Pillai (1955) Some new test criteria in multivariate analysis
Sparks et al. (1985) On variable strategy in multivariate regression
Sawa (1978) Information criteria for discriminating among alternative regression models
Schwarz (1978) Estimating the dimension of a model

Examples

Run this code

# Multivariate linear regression with bidirectional selection
data(mtcars)
formula <- cbind(mpg, drat) ~ . + 0
result1 <- stepwise(
  formula = formula,
  data = mtcars,
  type = "linear",
  strategy = "bidirection",
  metric = "AIC"
)

summary(result1$bidirection$AIC)
anova(result1$bidirection$AIC)
coefficients(result1$bidirection$AIC)

# Linear regression with multiple strategies and metrics
formula <- mpg ~ . + 1
result2 <- stepwise(
  formula = formula,
  data = mtcars,
  type = "linear",
  strategy = c("forward", "bidirection"),
  metric = c("AIC", "SBC", "SL", "AICc", "BIC", "HQ")
)

summary(result2$forward$AIC)
anova(result2$forward$AIC)
coefficients(result2$forward$AIC)

# Logistic regression with significance level criteria
data(remission)
formula <- remiss ~ .
result3 <- stepwise(
  formula = formula,
  data = remission,
  type = "logit",
  strategy = "forward",
  metric = "SL",
  sle = 0.05,
  sls = 0.05
)

summary(result3$forward$SL)
anova(result3$forward$SL)
coefficients(result3$forward$SL)

# Linear regression with continuous-nested-within-class effects
mtcars$am <- factor(mtcars$am)
formula <- mpg ~ am + cyl + wt:am + disp:am + hp:am
result4 <- stepwise(
  formula = formula,
  data = mtcars,
  type = "linear",
  strategy = "bidirection",
  metric = "AIC"
)

summary(result4$bidirection$AIC)
anova(result4$bidirection$AIC)
coefficients(result4$bidirection$AIC)

Run the code above in your browser using DataLab