Learn R Programming

corrselect

Fast and Flexible Predictor Pruning for Data Analysis and Modeling

The corrselect package provides simple, high-level functions for predictor pruning using association-based and model-based approaches. Whether you need to reduce multicollinearity before modeling or clean correlated predictors in your dataset, corrselect offers fast, deterministic solutions with minimal code.

Quick Start

library(corrselect)
data(mtcars)

# Association-based pruning (model-free)
pruned <- corrPrune(mtcars, threshold = 0.7)
names(pruned)

# Model-based pruning (VIF)
pruned <- modelPrune(mpg ~ ., data = mtcars, limit = 5)
attr(pruned, "selected_vars")

Statement of Need

Variable selection is a central task in statistics and machine learning, particularly when working with high-dimensional or collinear data. In many applications, users aim to retain sets of variables that are weakly associated with one another to avoid redundancy and reduce overfitting. Common approaches such as greedy filtering or regularized regression either discard useful features or do not guarantee bounded pairwise associations.

This package addresses the admissible set problem: selecting all maximal subsets of variables such that no pair exceeds a user-defined threshold. It generalizes to mixed-type data, supports multiple association metrics, and allows constrained subset selection via force_in (e.g. always include key predictors).

These features make the package useful in domains like:

  • ecological and bioclimatic modeling,
  • trait-based species selection,
  • interpretable machine learning pipelines.

Features

High-Level Pruning Functions

  • corrPrune(): Association-based predictor pruning

    • Model-free, works on raw data
    • Automatic correlation/association measure selection
    • Exact mode for guaranteed optimal solutions (recommended for p ≤ 100)
    • Fast greedy mode for large datasets (p > 100)
    • Protect important variables with force_in
  • modelPrune(): Model-based predictor pruning

    • VIF-based iterative removal
    • Supports lm, glm, lme4, glmmTMB engines
    • Custom engine support for any modeling package (INLA, mgcv, brms, etc.)
    • Prunes fixed effects in mixed models
    • Returns fitted model with pruned predictors

Advanced Subset Enumeration

  • Exhaustive exact subset search using graph algorithms:

    • Eppstein–Löffler–Strash (ELS)
    • Bron–Kerbosch (with optional pivoting)
    • Used internally by corrPrune(mode = "exact")
  • Multiple association metrics:

    • "pearson", "spearman", "kendall"
    • "bicor" (WGCNA), "distance" (energy), "maximal" (minerva)
    • "eta", "cramersv" for mixed-type data
  • force_in: protect variables from removal

  • Deterministic tie-breaking for reproducibility

Installation

# Install from CRAN
install.packages("corrselect")

# Or install development version from GitHub
# install.packages("pak")
pak::pak("gcol33/corrselect")

Usage Examples

Association-Based Pruning (corrPrune)

library(corrselect)
data(mtcars)

# Basic: Remove correlated predictors
pruned <- corrPrune(mtcars, threshold = 0.7)
names(pruned)

# Protect important variables
pruned <- corrPrune(mtcars, threshold = 0.7, force_in = "mpg")

# Use exact mode (slower, guaranteed optimal)
pruned <- corrPrune(mtcars, threshold = 0.7, mode = "exact")

# Use greedy mode (faster for large datasets)
pruned <- corrPrune(mtcars, threshold = 0.7, mode = "greedy")

# Check what was removed
attr(pruned, "selected_vars")

Model-Based Pruning (modelPrune)

# Linear model with VIF threshold
pruned <- modelPrune(mpg ~ cyl + disp + hp + wt, data = mtcars, limit = 5)
attr(pruned, "removed_vars")

# GLM with binomial family
mtcars$am_binary <- as.factor(mtcars$am)
pruned <- modelPrune(am_binary ~ cyl + disp + hp,
                     data = mtcars, engine = "glm",
                     family = binomial(), limit = 5)

# Mixed model (requires lme4)
if (requireNamespace("lme4", quietly = TRUE)) {
  # Use built-in sleepstudy data with polynomial terms
  sleep <- lme4::sleepstudy
  sleep$Days2 <- sleep$Days^2
  suppressWarnings(
    pruned <- modelPrune(Reaction ~ Days + Days2 + (1|Subject),
                         data = sleep, engine = "lme4", limit = 5)
  )
  attr(pruned, "selected_vars")
}

# Custom engine (advanced: works with any modeling package)
# Example: INLA-based pruning
if (requireNamespace("INLA", quietly = TRUE)) {
  inla_engine <- list(
    name = "inla",
    fit = function(formula, data, ...) {
      INLA::inla(formula = formula, data = data,
                 family = "gaussian", ...)
    },
    diagnostics = function(model, fixed_effects) {
      # Use posterior SD as badness metric
      scores <- model$summary.fixed[, "sd"]
      names(scores) <- rownames(model$summary.fixed)
      scores[fixed_effects]
    }
  )

  pruned <- modelPrune(y ~ x1 + x2, data = df,
                       engine = inla_engine, limit = 0.5)
}

Exact Subset Enumeration (Advanced)

# Find ALL maximal subsets
res <- corrSelect(mtcars, threshold = 0.7)
show(res)

# Extract a specific subset
subset1 <- corrSubset(res, mtcars, which = 1)

# Convert to data frame
as.data.frame(res)

Choosing Between corrPrune and modelPrune

FeaturecorrPrune()modelPrune()
Requires model specification?NoYes
Based onPairwise correlations/associationsModel diagnostics (VIF)
SpeedFast (greedy mode)Moderate (refits models)
Works without response?YesNo
Supports mixed models?NoYes (lme4, glmmTMB)
Best forExploratory analysis, large pRegression workflows, VIF reduction

Tip: Use corrPrune() first to reduce dimensionality, then modelPrune() for final cleanup within a modeling framework.

Advanced Features

Mixed-Type Data

Use assocSelect() for exact enumeration with mixed data types:

df <- data.frame(
  height = rnorm(30, 170, 10),
  weight = rnorm(30, 70, 12),
  group  = factor(sample(c("A","B"), 30, TRUE)),
  rating = ordered(sample(c("low","med","high"), 30, TRUE))
)

res <- assocSelect(df, threshold = 0.6)
show(res)

Precomputed Correlation Matrices

Work directly with correlation matrices:

mat <- cor(mtcars)
res <- MatSelect(mat, threshold = 0.7, method = "els")

JOSS Paper

This repository includes a short paper prepared for submission to the Journal of Open Source Software (JOSS). You can find the manuscript and references in the paper/ directory:

  • paper/paper.md – main text
  • paper/paper.bib – bibliography

License

MIT (see the LICENSE.md file)

Copy Link

Version

Install

install.packages('corrselect')

Monthly Downloads

504

Version

3.0.5

License

MIT + file LICENSE

Maintainer

Gilles Colling

Last Published

December 16th, 2025

Functions in corrselect (3.0.5)

corrSubset

Extract Variable Subsets from a CorrCombo Object
corrSelect

Select Variable Subsets with Low Correlation (Data Frame Interface)
as.data.frame.CorrCombo

Coerce CorrCombo to a Data Frame
MatSelect

Select Variable Subsets with Low Correlation or Association (Matrix Interface)
CorrCombo

CorrCombo S4 class
assocSelect

Select Variable Subsets with Low Association (Mixed-Type Data Frame Interface)
genes_example

Example Gene Expression Data for Bioinformatics
corrPrune

Association-Based Predictor Pruning
survey_example

Example Survey Data for Social Science Research
bioclim_example

Example Bioclimatic Data for Ecological Modeling
cor_example

Example Correlation Matrix with Block Structure
modelPrune

Model-Based Predictor Pruning
longitudinal_example

Example Longitudinal Data for Clinical Research