Learn R Programming

collinear Automated Multicollinearity Management

Summary

The R package collinear provides a comprehensive toolkit for smart multicollinearity management in datasets with mixed variable types. The main function, collinear(), integrates five core components:

  • Target Encoding (function target_encoding_lab()): Transparently converts categorical predictors to numeric when required, enabling VIF and correlation analysis across mixed data types.

  • Intelligent Predictor Ranking (function preference_order()): Prioritizes predictors by their univariate association with the response to ensure that the most relevant ones are retained during filtering.

  • Unified Correlation Framework (function cor_df()): Computes pairwise correlations between any variable types using Pearson correlation (numeric-numeric), target encoding (numeric-categorical), and Cramer’s V (categorical-categorical) within a single, consistent workflow.

  • Adaptive Filtering Thresholds (function collinear()): Automatically configures correlation and VIF thresholds based on each dataset’s correlation structure, eliminating guesswork while allowing manual override.

  • Dual Filtering Strategy (function collinear_select()): Combines pairwise correlation and Variance Inflation Factor filtering while considering predictor rankings to manage multicollinearity while maximizing the predictive power of the resulting selection of predictors.

These methods, except target encoding are also fully integrated into the tidymodels implementation step_collinear().

The package also provides diagnostic functions (cor_df(), vif_df(), collinear_stats()) to help users assess multicollinearity in their data before and after filtering.

Install

The package collinear can be installed from CRAN.

install.packages("collinear")

The development version can be installed from GitHub.

remotes::install_github(
  repo = "blasbenito/collinear", 
  ref = "development"
  )

Previous versions are in the archive_xxx branches of the GitHub repository.

remotes::install_github(
  repo = "blasbenito/collinear", 
  ref = "archive_v2.0.0"
  )

Getting Started

Setup

Most functions in the collinear package support parallelization and progress bars via future and progressr. Parallelization, however, is only advantageous for large datasets including categorical predictors.

library(collinear)
library(future)
library(progressr)

future::plan(
  future::multisession,
  workers = future::availableCores() - 1
  )

#does not work in Rmarkdown
#progressr::handlers(global = TRUE)

Example Data

The example dataframe vi_smol has several response variables and a large set of predictors. Here we focus on the numeric response vi_numeric, and the complete set of numeric and categorical predictors, stored in the vector vi_predictors.

data(vi_smol, vi_predictors)
nrow(vi_smol)
#> [1] 610
length(vi_predictors)
#> [1] 58

Multicollinearity Analysis

The package provides several functions to assess multicollinearity.

The functions cor_df() and vif_df() generate dataframes with the pairwise correlations and VIF scores of the predictors, while the functions cor_stats(), vif_stats() (used below) and collinear_stats() generate descriptive multicollinearity statistics. Notice that the function takes a while to execute because computing correlations for the categorical variables in vi_predictors is computationally expensive.

collinear::vif_stats(
  df = vi_smol,
  predictors = vi_predictors,
  quiet = TRUE
)
#> Warning: 
#> collinear::vif_stats()
#> └── collinear::vif_df()
#>     └── collinear::cor_matrix()
#>         └── collinear::cor_df(): 11 categorical predictors have cardinality > 2 and may bias the multicollinearity analysis. Applying target encoding to convert them to numeric will solve this issue.
#>   method     statistic     value
#> 1    vif             n   58.0000
#> 2    vif       minimum -194.5080
#> 3    vif quantile_0.05  -11.2993
#> 4    vif quantile_0.25    0.4528
#> 5    vif          mean   67.8998
#> 6    vif        median   25.2999
#> 7    vif quantile_0.75   90.3970
#> 8    vif quantile_0.95  316.6229
#> 9    vif       maximum  629.6362

The quantile 0.75 returned by vif_stats() indicates that 25% of the predictors have a VIF score higher than 6.5. This suggests substantial redundancy in the set of predictors.

Multicollinearity Filtering

To reduce multicollinearity in vi_smol we apply collinear() with a minimal setup.

x <- collinear::collinear(
  df = vi_smol,
  response = "vi_numeric",
  predictors = vi_predictors
)
#> 
#> collinear::collinear()
#> └── collinear::validate_arg_df(): converted the following character columns to factor:
#>  - koppen_zone
#>  - koppen_group
#>  - koppen_description
#>  - biogeo_ecoregion
#>  - biogeo_biome
#>  - biogeo_realm
#>  - country_name
#>  - continent
#>  - region
#>  - subregion
#> 
#> collinear::collinear()
#> └── collinear::cor_df(): 11 categorical predictors have cardinality > 2 and may bias the multicollinearity analysis. Applying target encoding to convert them to numeric will solve this issue.
#> 
#> collinear::collinear(): setting 'max_cor' to 0.6842.
#> 
#> collinear::collinear(): setting 'max_vif' to 6.5269.
#> 
#> collinear::collinear()
#> └── collinear::preference_order()
#>     └── collinear::f_auto(): selected function 'f_numeric_rf()' to compute preference order.
#> 
#> collinear::collinear()
#> └── collinear::collinear_select()
#>     └── collinear::vif(): some VIF values exceeded 1M and were set to Inf.
#> 
#> collinear::collinear(): selected predictors: 
#>  - rainfall_mean
#>  - swi_mean
#>  - evapotranspiration_max
#>  - evapotranspiration_range
#>  - swi_min
#>  - soil_soc
#>  - humidity_range
#>  - topo_elevation
#>  - cloud_cover_range
#>  - continent
#>  - soil_sand
#>  - topo_diversity
#>  - topo_slope

The function returns an object of class collinear_output, that has its own print() and summary() methods.

x
#> Result
#> ===================
#>  - response: vi_numeric
#>    --------------------
#> 
#>  + df:
#>    - rows: 610
#>    - cols: 14
#> 
#>  + preference order:
#>    + df:
#>      - rows: 58
#>      - cols: 6
#>    + preference:
#>      - rainfall_mean
#>      - growing_season_rainfall
#>      - growing_season_length
#>      - swi_mean
#>      - aridity_index
#>    + f: f_numeric_rf
#> 
#>  + selection:
#>    - rainfall_mean
#>    - swi_mean
#>    - evapotranspiration_max
#>    - evapotranspiration_range
#>    - swi_min
#>    - ... (8 ommited)
#> 
#>  + formulas:
#>    - linear: vi_numeric ~ rainfall_mean + swi_mean + evapotranspiration_max + evapotranspiration_range + swi_min + ... (8 terms omitted) 
#>    - smooth: vi_numeric ~ s(rainfall_mean) + s(swi_mean) + s(evapotranspiration_max) + s(evapotranspiration_range) + s(swi_min) + ... (8 terms omitted)

The object df contains the response and the selected predictors.

colnames(x$vi_numeric$df)
#>  [1] "vi_numeric"               "rainfall_mean"           
#>  [3] "swi_mean"                 "evapotranspiration_max"  
#>  [5] "evapotranspiration_range" "swi_min"                 
#>  [7] "soil_soc"                 "humidity_range"          
#>  [9] "topo_elevation"           "cloud_cover_range"       
#> [11] "continent"                "soil_sand"               
#> [13] "topo_diversity"           "topo_slope"

The object preference_order contains the ranking of predictors. It is computed by the function preference_order() by assessing the association between the response and the predictors. In this case, it fits univariate models between the response and each predictor using f_numeric_rf, and returns the R-squared of the observations vs the model predictions. This ranking ensures that the most important predictors are protected during multicollinearity filtering.

head(x$vi_numeric$preference_order)
#>     response               predictor            f    metric  score rank
#> 1 vi_numeric           rainfall_mean f_numeric_rf R-squared 0.8835    1
#> 2 vi_numeric growing_season_rainfall f_numeric_rf R-squared 0.8731    2
#> 3 vi_numeric   growing_season_length f_numeric_rf R-squared 0.8644    3
#> 4 vi_numeric                swi_mean f_numeric_rf R-squared 0.8333    4
#> 5 vi_numeric           aridity_index f_numeric_rf R-squared 0.8279    5
#> 6 vi_numeric             koppen_zone f_numeric_rf R-squared 0.8174    6

The object selection contains non-collinear predictors chosen by taking into account their pairwise correlation and VIF against all other predictors, and their position in the ranking above.

x$vi_numeric$selection
#>  [1] "rainfall_mean"            "swi_mean"                
#>  [3] "evapotranspiration_max"   "evapotranspiration_range"
#>  [5] "swi_min"                  "soil_soc"                
#>  [7] "humidity_range"           "topo_elevation"          
#>  [9] "cloud_cover_range"        "continent"               
#> [11] "soil_sand"                "topo_diversity"          
#> [13] "topo_slope"              
#> attr(,"validated")
#> [1] TRUE

We can check that this selection of predictors shows low multicollinearity by running the function vif_df() on them.

collinear::vif_df(
  df = x$vi_numeric$df,
  predictors = x$vi_numeric$selection
)
#>        vif                predictor
#> 1   4.7203        cloud_cover_range
#> 2   3.3447                continent
#> 3   3.1112   evapotranspiration_max
#> 4   3.0627 evapotranspiration_range
#> 5   2.3856           humidity_range
#> 6   2.0747            rainfall_mean
#> 7   1.9928                soil_sand
#> 8   1.7690                 soil_soc
#> 9   1.6346                 swi_mean
#> 10  1.5984                  swi_min
#> 11  1.3531           topo_diversity
#> 12  0.9320           topo_elevation
#> 13 -0.2235               topo_slope

All VIF scores are below 2.5, collinear() did a good job here!

Finally, collinear() also returns modeling formulas to help kick start exploratory modelling.

x$vi_numeric$formulas
#> $linear
#> vi_numeric ~ rainfall_mean + swi_mean + evapotranspiration_max + 
#>     evapotranspiration_range + swi_min + soil_soc + humidity_range + 
#>     topo_elevation + cloud_cover_range + continent + soil_sand + 
#>     topo_diversity + topo_slope
#> <environment: 0x5adb1977e4b0>
#> 
#> $smooth
#> vi_numeric ~ s(rainfall_mean) + s(swi_mean) + s(evapotranspiration_max) + 
#>     s(evapotranspiration_range) + s(swi_min) + s(soil_soc) + 
#>     s(humidity_range) + s(topo_elevation) + s(cloud_cover_range) + 
#>     continent + s(soil_sand) + s(topo_diversity) + s(topo_slope)
#> <environment: 0x5adb1977e4b0>

The function returns linear formulas for numeric outcomes, and classification formulas for categorical outcomes.

Model Fitting

The output of collinear() can be used right away to fit exploratory models, as shown below.

m <- lm(
  formula = x$vi_numeric$formulas$linear, 
  data = x$vi_numeric$df,
  na.action = na.omit
  )

summary(m)
#> 
#> Call:
#> lm(formula = x$vi_numeric$formulas$linear, data = x$vi_numeric$df, 
#>     na.action = na.omit)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.42639 -0.04274  0.00107  0.04804  0.24228 
#> 
#> Coefficients:
#>                            Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               3.027e-01  4.557e-02   6.644 6.97e-11 ***
#> rainfall_mean             4.183e-05  7.152e-06   5.849 8.22e-09 ***
#> swi_mean                  6.997e-03  4.415e-04  15.850  < 2e-16 ***
#> evapotranspiration_max   -1.206e-03  1.694e-04  -7.117 3.22e-12 ***
#> evapotranspiration_range -1.763e-04  1.396e-04  -1.263 0.207122    
#> swi_min                  -2.473e-03  6.162e-04  -4.013 6.77e-05 ***
#> soil_soc                 -4.845e-04  1.944e-04  -2.492 0.012964 *  
#> humidity_range           -1.042e-03  6.301e-04  -1.654 0.098731 .  
#> topo_elevation           -3.991e-05  7.280e-06  -5.481 6.27e-08 ***
#> cloud_cover_range         5.109e-04  3.588e-04   1.424 0.155005    
#> continentAsia            -2.847e-02  1.227e-02  -2.320 0.020687 *  
#> continentEurope           3.893e-02  1.895e-02   2.054 0.040412 *  
#> continentNorth America    6.304e-02  1.525e-02   4.134 4.08e-05 ***
#> continentOceania          4.555e-02  1.478e-02   3.083 0.002146 ** 
#> continentSouth America    5.547e-02  1.242e-02   4.466 9.54e-06 ***
#> soil_sand                 5.123e-04  2.821e-04   1.816 0.069887 .  
#> topo_diversity            3.318e-03  8.973e-04   3.698 0.000238 ***
#> topo_slope                4.209e-03  1.392e-03   3.024 0.002600 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.08237 on 589 degrees of freedom
#>   (3 observations deleted due to missingness)
#> Multiple R-squared:  0.8515, Adjusted R-squared:  0.8473 
#> F-statistic: 198.7 on 17 and 589 DF,  p-value: < 2.2e-16

Integration with tidymodels

The function step_collinear() wraps collinear() to facilitate its usage in tidymodels recipes. Please notice that step_collinear() does not perform target encoding, as combining this functionality with multicollinearity filtering does not fit well with how recipes works.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(recipes)
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(parsnip)
library(workflows)

# model formula
vi_formula <- collinear::model_formula(
  df = vi_smol,
  response = "vi_numeric",
  predictors = vi_predictors
)

# recipe
vi_recipe <- recipes::recipe(
  formula = vi_formula,
  data = vi_smol
  ) |>
  #multicollinearity filtering
  collinear::step_collinear(
    recipes::all_predictors()
  )

#linear model
vi_model <- parsnip::rand_forest() |>
  parsnip::set_engine("ranger", importance = "permutation") |>
  parsnip::set_mode("regression")

#create and fit workflow
vi_model <- parsnip::linear_reg() |>
  parsnip::set_engine("lm")

# create and fit workflow
vi_workflow <- workflows::workflow() |>
  workflows::add_recipe(vi_recipe) |>
  workflows::add_model(vi_model) |>
  workflows::fit(data = vi_smol)
#> Warning: 
#> collinear::collinear()
#> └── collinear::cor_df(): 11 categorical predictors have cardinality > 2 and may bias the multicollinearity analysis. Applying target encoding to convert them to numeric will solve this issue.

vi_workflow
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_collinear()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> 
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#> 
#> Coefficients:
#>              (Intercept)             rainfall_mean                  swi_mean  
#>                3.027e-01                 4.183e-05                 6.997e-03  
#>   evapotranspiration_max  evapotranspiration_range                   swi_min  
#>               -1.206e-03                -1.763e-04                -2.473e-03  
#>                 soil_soc            humidity_range            topo_elevation  
#>               -4.845e-04                -1.042e-03                -3.991e-05  
#>        cloud_cover_range             continentAsia           continentEurope  
#>                5.109e-04                -2.847e-02                 3.893e-02  
#>   continentNorth America          continentOceania    continentSouth America  
#>                6.304e-02                 4.555e-02                 5.547e-02  
#>                soil_sand            topo_diversity                topo_slope  
#>                5.123e-04                 3.318e-03                 4.209e-03

Citation

If you find this package useful, please cite it as:

*Blas M. Benito (2025). collinear: R Package for Automated Multicollinearity Management. Version 3.0.0. doi: 10.5281/zenodo.17843578 *

Getting help

If you encounter bugs or issues with the documentation, please file a issue on GitHub.

Copy Link

Version

Install

install.packages('collinear')

Monthly Downloads

348

Version

3.0.0

License

MIT + file LICENSE

Maintainer

Blas M. Benito

Last Published

December 8th, 2025

Functions in collinear (3.0.0)

cor_cramer

Quantify association between categorical variables
f_binomial_gam

Area under the curve of binomial GAM predictions vs. observations
experiment_cor_vs_vif

Dataframe with results of experiment comparing correlation and VIF thresholds
experiment_adaptive_thresholds

Dataframe resulting from experiment to test the automatic selection of multicollinearity thresholds
f_count_gam

R-squared of Poisson GAM predictions vs. observations
f_categorical_rf

Cramer's V of Categorical Random Forest predictions vs. observations
drop_geometry_column

Removes geometry Column From sf Dataframes
f_auto_rules

Decision rules for f_auto()
f_binomial_rf

Area Under the Curve of Binomial Random Forest predictions vs. observations
f_binomial_glm

Area Under the Curve of Binomial GLM predictions vs. observations
f_auto

Automatic selection of predictor scoring method
f_numeric_gam

R-squared of Gaussian GAM predictions vs. observations
f_functions

List predictor scoring functions
f_count_rf

R-squared of Random Forest predictions vs. observations
f_count_glm

R-squared of Poisson GLM predictions vs. observations
identify_response_type

Detect response variable type for model selection
f_numeric_rf

R-squared of Random Forest predictions vs. observations
f_numeric_glm

R-squared of Gaussian GLM predictions vs. observations
score_cramer

Compute Cramer's V between categorical observations and predictions
score_auc

Compute area under the ROC curve between binomial observations and probabilistic predictions
identify_logical_variables

Find logical variables in a dataframe
score_r2

Compute R-squared between numeric observations and predictions
identify_categorical_variables

Find valid categorical variables in a dataframe
preference_order

Rank predictors by importance or multicollinearity
identify_numeric_variables

Find valid numeric variables in a dataframe
validate_arg_df

Check and prepare argument df
prediction_cor_to_vif

Prediction of the model gam_cor_to_vif across correlation values
gam_cor_to_vif

GAM describing the relationship between correlation and VIF thresholds
toy

Toy dataframe with varying levels of multicollinearity
identify_valid_variables

Find valid numeric, categorical, and logical variables in a dataframe
validate_arg_df_not_null

Ensure that argument df is not NULL
identify_zero_variance_variables

Find near-zero variance variables in a dataframe
validate_arg_max_vif

Check and constrain argument max_vif
step_collinear

Tidymodels recipe step for multicollinearity filtering
print.collinear_selection

Print single selection results from collinear
validate_arg_responses

Check and validate arguments response and responses
target_encoding_loo

Encode categories as response means
target_encoding_lab

Convert categorical predictors to numeric via target encoding
validate_arg_function_name

Build hierarchical function names for messages
print.collinear_output

Print all collinear selection results of collinear()
model_formula

Build model formulas from response and predictors
summary.collinear_selection

Summarize single response selection results of collinear
validate_arg_max_cor

Check and constrain argument max_cor
vi

Large example dataframe
summary.collinear_output

Summarize all results of collinear()
validate_arg_quiet

Check and validate argument quiet
validate_arg_f

Check and validate argument f
validate_arg_encoding_method

Check and validate argument encoding_method
validate_arg_predictors

Check and validate argument predictors
vif_df

Compute variance inflation factors dataframe
vi_predictors

Vector of all predictor names in vi and vi_smol
validate_arg_preference_order

Check and complete argument preference_order
vif_select

Multicollinearity filtering by variance inflation factor threshold
vif_stats

VIF Statistics
vif

Compute variance inflation factors from a correlation matrix
vi_smol

Small example dataframe
vi_responses

Vector of response names in vi and vi_smol
vi_predictors_categorical

Vector of categorical predictors in vi and vi_smol
vi_predictors_numeric

Vector of numeric predictor names in vi and vi_smol
cor_clusters

Group predictors by hierarchical correlation clustering
collinear_stats

Compute summary statistics for correlation and VIF
collinear_select

Dual multicollinearity filtering algorithm
cor_matrix

Signed pairwise correlation matrix
cor_df

Compute signed pairwise correlations dataframe
cor_stats

Compute summary statistics for absolute pairwise correlations
case_weights

Generate sample weights for imbalanced responses
cor_select

Multicollinearity filtering by pairwise correlation threshold
collinear

Smart multicollinearity management