regression_opt: Stepwise Multiple Regression Model Search based on Information Criteria

Description

Performs stepwise model selection for multiple regression using information criteria to identify the optimal regression model.

Usage

regression_opt(
  data = NULL,
  n = NULL,
  mat = NULL,
  dep_ind,
  n_calc = "individual",
  ic_type = "bic",
  ordered = FALSE,
  missing_handling = "stacked-mi",
  nimp = 20,
  imp_method = "pmm",
  ...
)

Value

A list with the following elements:

regression: Named vector of regression coefficients for the dependent variable.
R2: R-squared value of the regression model.
n: Sample size used in the regression model.
args: List of settings used in the regression model.

Arguments

data

Raw data matrix or data frame containing the variables to be included in the regression models. May include missing values. If data is NULL, a covariance or correlation matrix must be supplied in mat.

n

Numeric value specifying the sample size used in calculating the information criteria. If not provided, it is derived from data. When mat is supplied instead of raw data, n must be provided.

mat

Optional covariance or correlation matrix for the variables to be included in the regression. Used only when data is NULL.

dep_ind

Index of the column in data to be used as the dependent variable in the regression model.

n_calc

Character string specifying how the sample size is calculated when n is not provided. Possible values are:

"individual": Uses the number of non-missing observations for the variable used as the dependent variable.

"average"

Uses the average number of non-missing observations across all variables.

"max"

Uses the maximum number of non-missing observations across all variables.

"total"

Uses the total number of rows in data.

ic_type

Type of information criterion to compute for model selection. Options are bic (default), aic, aicc.

ordered

Logical vector indicating whether each variable in data should be treated as ordered categorical when computing the correlation matrix. If a single logical value is supplied, it is recycled to all variables. Only used when data is provided.

missing_handling

Character string specifying how the correlation matrix is estimated from data in the presence of missing values. Possible values are:

"two-step-em": Uses a classical EM algorithm to estimate the correlation matrix from data.

"stacked-mi"

Uses stacked multiple imputation to estimate the correlation matrix from data.

"pairwise"

Uses pairwise deletion to compute correlations from data.

"listwise"

Uses listwise deletion to compute correlations from data.

nimp

Number of imputations (default: 20) to be used when missing_handling = "stacked-mi".

imp_method

Character string specifying the imputation method to be used when missing_handling = "stacked-mi" (default: "pmm" - predictive mean matching).

...

Further arguments passed to internal functions.

Details

This function performs stepwise model selection for multiple regression using information criteria. It was originally developed as a component of the neighborhood selection framework for network estimation nehler.2024mantar, where each node-wise regression model is selected individually. However, the procedure can also be used as a standalone tool for exploratory regression model search, particularly in settings with missing data. Unlike standard stepwise regression functions, this implementation explicitly supports missing-data handling strategies, making it suitable for situations in which classical methods fail or produce biased results.

The argument ic_type specifies which information criterion is computed. All criteria are computed based on the log-likelihood of the maximum likelihood estimated regression model, where the residual variance determines the likelihood. The following options are available:

"aic":: Akaike Information Criterion akaike.1974mantar; defined as AIC = -2 + 2k, where \(\ell\) is the log-likelihood of the model and \(k\) is the number of estimated parameters (including the intercept).
"bic":: Bayesian Information Criterion schwarz.1978mantar; defined as BIC = -2 + k (n), where \(\ell\) is the log-likelihood of the model, \(k\) is the number of estimated parameters (including the intercept) and \(n\) is the sample size.
"aicc":: Corrected Akaike Information Criterion hurvich.1989mantar; particularly useful in small samples where AIC tends to be biased. Defined as AIC_c = AIC + 2k(k+1)n - k - 1, where \(k\) is the number of estimated parameters (including the intercept) and \(n\) is the sample size.

References

Examples

Run this code

# For full data using AIC
# First variable of the data set as dependent variable
result <- regression_opt(
  data = mantar_dummy_full_cont,
  dep_ind = 1,
  ic_type = "aic"
)

# View regression coefficients and R-squared
result$regression
result$R2

# For data with missingess using BIC
# Second variable of the data set as dependent variable
# Using individual sample size of the dependent variable and stacked Multiple Imputation

result_mis <- regression_opt(
 data = mantar_dummy_mis_cont,
 dep_ind = 2,
 n_calc = "individual",
 missing_handling = "two-step-em",
 ic_type = "bic"
 )

 # View regression coefficients and R-squared
 result_mis$regression
 result_mis$R2

Run the code above in your browser using DataLab