auto_cor: Multicollinearity reduction via Pearson correlation

Description

Filters predictors using sequential evaluation of pairwise correlations. Predictors are ranked by user preference (or column order) and evaluated sequentially. Each candidate is added to the selected pool only if its maximum absolute correlation with already-selected predictors does not exceed the threshold.

Usage

auto_cor(
  x = NULL,
  preference.order = NULL,
  cor.threshold = 0.5,
  verbose = TRUE
)

Value

List with class variable_selection containing:

cor: Correlation matrix of selected variables (only if 2+ variables selected).
selected.variables: Character vector of selected variable names.
selected.variables.df: Data frame containing selected variables.

Arguments

x: Data frame with predictors, or a variable_selection object from auto_vif(). Default: NULL.
preference.order: Character vector specifying variable preference order. Does not need to include all variables in x. If NULL, column order is used. Default: NULL.
cor.threshold: Numeric between 0 and 1 (recommended: 0.5 to 0.9). Maximum allowed absolute Pearson correlation between selected variables. Default: 0.50
verbose: Logical. If TRUE, prints messages about operations and removed variables. Default: TRUE

Details

The algorithm follows these steps:

Rank predictors by preference.order (or use column order if NULL).
Initialize selection pool with first predictor.
For each remaining candidate:
- Compute maximum absolute correlation with selected predictors.
- If max correlation equal or lower than cor.threshold, add to selected pool.
- Otherwise, skip candidate.
Return selected predictors.

Data cleaning: Variables in preference.order not found in colnames(x) are silently removed. Non-numeric columns are removed with a warning. Rows with NA values are removed via na.omit(). Zero-variance columns trigger a warning but are not removed.

This function can be chained with auto_vif() through pipes (see examples).

Examples

Run this code

data(
  plants_df,
  plants_predictors
)

y <- auto_cor(
  x = plants_df[, plants_predictors]
)

y$selected.variables
y$cor
head(y$selected.variables.df)