Learn R Programming

⚠️There's a newer version (3.0.0) of this package.Take me there.

collinear: R Package for Seamless Multicollinearity Management

Warning

Version 2.0.0 of collinear includes changes that may disrupt existing workflows, and results from previous versions may not be reproducible due to enhancements in the automated selection algorithms. Please refer to the Changelog for details.

Summary

Multicollinearity hinders the interpretability of linear and machine learning models.

The collinear package combines four methods for easy management of multicollinearity in modelling data frames with numeric and categorical variables:

  • Target Encoding: Transforms categorical predictors to numeric using a numeric response as reference.
  • Preference Order: Ranks predictors by their association with a response variable to preserve important ones in multicollinearity filtering.
  • Pairwise Correlation Filtering: Automated multicollinearity filtering of numeric and categorical predictors based on pairwise correlations.
  • Variance Inflation Factor Filtering: Automated multicollinearity filtering of numeric predictors based on Variance Inflation Factors.

These methods are combined in the function collinear(), which serves as single entry point for most of the functionalities in the package. The article How It Works explains how collinear() works in detail.

Citation

If you find this package useful, please cite it as:

Blas M. Benito (2024). collinear: R Package for Seamless Multicollinearity Management. Version 2.0.0. doi: 10.5281/zenodo.10039489

Main Improvements in Version 2.0.0

  1. Expanded Functionality: Functions collinear() and preference_order() support both categorical and numeric responses and predictors, and can handle several responses at once.
  2. Robust Selection Algorithms: Enhanced selection in vif_select() and cor_select().
  3. Enhanced Functionality to Rank Predictors: New functions to compute association between response and predictors covering most use-cases, and automated function selection depending on data features.
  4. Simplified Target Encoding: Streamlined and parallelized for better efficiency, and new default is “loo” (leave-one-out).
  5. Parallelization and Progress Bars: Utilizes future and progressr for enhanced performance and user experience.

Install

The package collinear can be installed from CRAN.

install.packages("collinear")

The development version can be installed from GitHub.

remotes::install_github(
  repo = "blasbenito/collinear", 
  ref = "development"
  )

Previous versions are in the “archive_xxx” branches of the GitHub repository.

remotes::install_github(
  repo = "blasbenito/collinear", 
  ref = "archive_v1.1.1"
  )

Getting Started

The function collinear() provides all tools required for a fully fledged multicollinearity filtering workflow. The code below shows a small example workflow.

#parallelization setup
future::plan(
  future::multisession,
  workers = parallelly::availableCores() - 1
  )

#progress bar (does not work in Rmarkdown)
#progressr::handlers(global = TRUE)

#example data frame
df <- collinear::vi[1:5000, ]

#there are many NA cases in this data frame
sum(is.na(df))
#> [1] 3391
#numeric and categorical predictors
predictors <- collinear::vi_predictors

collinear::identify_predictors(
  df = df,
  predictors = predictors
)
#> $numeric
#>  [1] "topo_slope"                 "topo_diversity"            
#>  [3] "topo_elevation"             "swi_mean"                  
#>  [5] "swi_max"                    "swi_min"                   
#>  [7] "swi_range"                  "soil_temperature_mean"     
#>  [9] "soil_temperature_max"       "soil_temperature_min"      
#> [11] "soil_temperature_range"     "soil_sand"                 
#> [13] "soil_clay"                  "soil_silt"                 
#> [15] "soil_ph"                    "soil_soc"                  
#> [17] "soil_nitrogen"              "solar_rad_mean"            
#> [19] "solar_rad_max"              "solar_rad_min"             
#> [21] "solar_rad_range"            "growing_season_length"     
#> [23] "growing_season_temperature" "growing_season_rainfall"   
#> [25] "growing_degree_days"        "temperature_mean"          
#> [27] "temperature_max"            "temperature_min"           
#> [29] "temperature_range"          "temperature_seasonality"   
#> [31] "rainfall_mean"              "rainfall_min"              
#> [33] "rainfall_max"               "rainfall_range"            
#> [35] "evapotranspiration_mean"    "evapotranspiration_max"    
#> [37] "evapotranspiration_min"     "evapotranspiration_range"  
#> [39] "cloud_cover_mean"           "cloud_cover_max"           
#> [41] "cloud_cover_min"            "cloud_cover_range"         
#> [43] "aridity_index"              "humidity_mean"             
#> [45] "humidity_max"               "humidity_min"              
#> [47] "humidity_range"             "country_population"        
#> [49] "country_gdp"               
#> 
#> $categorical
#>  [1] "koppen_zone"        "koppen_group"       "koppen_description"
#>  [4] "soil_type"          "biogeo_ecoregion"   "biogeo_biome"      
#>  [7] "biogeo_realm"       "country_name"       "country_income"    
#> [10] "continent"          "region"             "subregion"
#multicollinearity filtering
selection <- collinear::collinear(
  df = df,
  response = c(
    "vi_numeric",    #numeric response
    "vi_categorical" #categorical response
    ),
  predictors = predictors,
  max_cor = 0.75,
  max_vif = 5,
  quiet = TRUE
)

The output is a named list of vectors with selected predictor names when more than one response is provided, and a character vector otherwise.

selection
#> $vi_numeric
#>  [1] "growing_season_length"  "soil_temperature_max"   "soil_temperature_range"
#>  [4] "solar_rad_max"          "rainfall_max"           "subregion"             
#>  [7] "biogeo_realm"           "swi_range"              "rainfall_min"          
#> [10] "soil_nitrogen"          "continent"              "cloud_cover_range"     
#> [13] "topo_diversity"        
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_numeric"
#> 
#> $vi_categorical
#>  [1] "rainfall_mean"        "swi_mean"             "soil_temperature_max"
#>  [4] "soil_type"            "humidity_max"         "solar_rad_max"       
#>  [7] "country_gdp"          "swi_range"            "rainfall_range"      
#> [10] "country_population"   "soil_soc"             "region"              
#> [13] "country_income"       "topo_diversity"       "topo_slope"          
#> attr(,"validated")
#> [1] TRUE
#> attr(,"response")
#> [1] "vi_categorical"

The output of collinear() can be easily converted into model formulas.

formulas <- collinear::model_formula(
  predictors = selection
)

formulas
#> $vi_numeric
#> vi_numeric ~ growing_season_length + soil_temperature_max + soil_temperature_range + 
#>     solar_rad_max + rainfall_max + subregion + biogeo_realm + 
#>     swi_range + rainfall_min + soil_nitrogen + continent + cloud_cover_range + 
#>     topo_diversity
#> <environment: 0x5afafd703c68>
#> 
#> $vi_categorical
#> vi_categorical ~ rainfall_mean + swi_mean + soil_temperature_max + 
#>     soil_type + humidity_max + solar_rad_max + country_gdp + 
#>     swi_range + rainfall_range + country_population + soil_soc + 
#>     region + country_income + topo_diversity + topo_slope
#> <environment: 0x5afafd703c68>

These formulas can be used to fit models right away.

#linear model
m_vi_numeric <- stats::glm(
  formula = formulas[["vi_numeric"]], 
  data = df,
  na.action = na.omit
  )

#random forest model
m_vi_categorical <- ranger::ranger(
  formula = formulas[["vi_categorical"]],
  data = na.omit(df)
)

Getting help

If you encounter bugs or issues with the documentation, please file a issue on GitHub.

Copy Link

Version

Install

install.packages('collinear')

Monthly Downloads

348

Version

2.0.0

License

MIT + file LICENSE

Maintainer

Blas M. Benito

Last Published

November 8th, 2024

Functions in collinear (2.0.0)

drop_geometry_column

Removes geometry column in sf data frames
f_auc

Association Between a Binomial Response and a Continuous Predictor
f_auto

Select Function to Compute Preference Order
f_r2_counts

Association Between a Count Response and a Continuous Predictor
f_v_rf_categorical

Association Between a Categorical Response and a Categorical or Numeric Predictor
encoded_predictor_name

Name of Target-Encoded Predictor
f_functions

Data Frame of Preference Functions
f_r2

Association Between a Continuous Response and a Continuous Predictor
f_auto_rules

Rules to Select Default f Argument to Compute Preference Order
identify_predictors

Identify Numeric and Categorical Predictors
f_v

Association Between a Categorical Response and a Categorical Predictor
identify_predictors_type

Identify Predictor Types
identify_predictors_numeric

Identify Valid Numeric Predictors
performance_score_r2

Pearson's R-squared of Observations vs Predictions
identify_predictors_categorical

Identify Valid Categorical Predictors
performance_score_auc

Area Under the Curve of Binomial Observations vs Probabilistic Model Predictions
performance_score_v

Cramer's V of Observations vs Predictions
preference_order

Quantitative Variable Prioritization for Multicollinearity Filtering
model_formula

Generate Model Formulas
identify_response_type

Identify Response Type
validate_data_cor

Validate Data for Correlation Analysis
identify_predictors_zero_variance

Identify Zero and Near-Zero Variance Predictors
preference_order_collinear

Preference Order Argument in collinear()
vi_predictors

All Predictor Names in Example Data Frame vi
target_encoding_lab

Target Encoding Lab: Transform Categorical Variables to Numeric
vif_df

Variance Inflation Factor
validate_data_vif

Validate Data for VIF Analysis
vi_predictors_numeric

All Numeric Predictor Names in Example Data Frame vi
validate_predictors

Validate Argument predictors
vif_select

Automated Multicollinearity Filtering with Variance Inflation Factors
target_encoding_mean

Target Encoding Methods
toy

One response and four predictors with varying levels of multicollinearity
validate_encoding_arguments

Validates Arguments of target_encoding_lab()
validate_df

Validate Argument df
validate_response

Validate Argument response
vi

Example Data With Different Response and Predictor Types
validate_preference_order

Validate Argument preference_order
vi_predictors_categorical

All Categorical and Factor Predictor Names in Example Data Frame vi
cor_clusters

Hierarchical Clustering from a Pairwise Correlation Matrix
collinear-package

collinear
cor_df

Pairwise Correlation Data Frame
cor_matrix

Pairwise Correlation Matrix
cor_cramer_v

Bias Corrected Cramer's V
add_white_noise

Add White Noise to Encoded Predictor
cor_select

Automated Multicollinearity Filtering with Pairwise Correlations
case_weights

Case Weights for Unbalanced Binomial or Categorical Responses
collinear

Automated multicollinearity management