corrPrune() performs model-free variable subset selection by iteratively
removing predictors until all pairwise associations fall below a specified
threshold. It returns a single pruned data frame with predictors that satisfy
the association constraint.
corrPrune(
data,
threshold = 0.7,
measure = "auto",
mode = "auto",
force_in = NULL,
by = NULL,
group_q = 1,
max_exact_p = 100,
...
)A data.frame containing the pruned subset of predictors. The result has the following attributes:
Character vector of retained variable names
Character vector of removed variable names
Character string indicating which mode was used ("exact" or "greedy")
Character string indicating which association measure was used
The threshold value used
A data.frame containing candidate predictors.
Numeric scalar. Maximum allowed pairwise association (default: 0.7). Must be non-negative.
Character string specifying the association measure to use.
Options: "auto" (default), "pearson", "spearman", "kendall",
"cramersv", "eta", etc. When "auto", Pearson correlation is used
for all-numeric data, and appropriate measures are selected for mixed-type
data.
Character string specifying the search algorithm. Options:
"auto" (default): uses exact search if number of predictors <= max_exact_p,
otherwise uses greedy search
"exact": exhaustive search for maximal subsets (may be slow for large p)
"greedy": fast approximate search using iterative removal
Character vector of variable names that must be retained in the final subset. Default: NULL.
Character vector naming one or more grouping variables. If provided,
associations are computed separately within each group, then aggregated
using the quantile specified by group_q. Default: NULL (no grouping).
Numeric scalar in (0, 1]. Quantile used to aggregate
associations across groups when by is provided. Default: 1 (maximum,
ensuring threshold holds in all groups). Use 0.9 for 90th percentile, etc.
Integer. Maximum number of predictors for which exact
mode is used when mode = "auto". Default: 100.
Additional arguments (reserved for future use).
corrPrune() identifies a subset of predictors whose pairwise associations
are all below threshold. The function works in several stages:
Variable type detection: Identifies numeric vs. categorical predictors
Association measurement: Computes appropriate pairwise associations
Grouping (optional): If by is specified, computes associations within
each group and aggregates using the specified quantile
Feasibility check: Verifies that force_in variables satisfy the
threshold constraint
Subset selection: Uses either exact or greedy search to find a valid subset
Grouped Pruning: When by is provided, the function ensures the selected
predictors satisfy the threshold constraint across groups. For example, with
group_q = 1 (default), the returned predictors will have pairwise associations
below threshold in all groups. With group_q = 0.9, they will satisfy
the constraint in at least 90% of groups.
Mode Selection: Exact mode guarantees finding all maximal subsets and returns the largest one (with deterministic tie-breaking). Greedy mode is faster but approximate, using a deterministic removal strategy based on association scores.
corrSelect for exhaustive subset enumeration,
assocSelect for mixed-type data subset enumeration,
modelPrune for model-based predictor pruning.
# Basic numeric data pruning
data(mtcars)
pruned <- corrPrune(mtcars, threshold = 0.7)
names(pruned)
# Force certain variables to be included
pruned <- corrPrune(mtcars, threshold = 0.7, force_in = "mpg")
# Use greedy mode for faster computation
pruned <- corrPrune(mtcars, threshold = 0.7, mode = "greedy")
Run the code above in your browser using DataLab