Pulls the per-tree importance scores out of a fitted varpro object
and summarises them into a data structure the plot method can draw as a
boxplot. The box hinges are the 15th and 85th percentiles and the
whiskers run to the 5th and 95th -- not the usual Tukey 1.5 IQR whiskers.
For a classification forest you can also keep the class-conditional
importances.
gg_varpro(
object,
local.std = TRUE,
cutoff = 0.79,
faithful = FALSE,
conditional = FALSE,
nvar = NULL,
...
)A named list of class "gg_varpro" with elements:
$impSummary data frame: variable (factor whose
levels run least- to most-important by median per-tree z, so the
most-important variable sits at the top of the plot after
coord_flip), z (aggregate
z-score from importance()), selected (logical,
z > cutoff).
$imp.treeNULL when faithful = FALSE;
otherwise an ntree x p matrix of per-tree importance values.
$statsPer-variable summary: variable,
median, q05, q15, q85, q95
(on z-scale when local.std = TRUE, raw when FALSE),
plus mean (raw importance mean, always stored).
$conditionalNULL when conditional = FALSE;
otherwise a data frame with columns variable, class,
z (one row per variable x class combination).
A "provenance" attribute carries family, local.std,
cutoff, faithful, conditional, xvar.names,
and n.
A fitted varpro object (required).
Logical; default TRUE. When TRUE the
per-tree importances are put on the z-scale before the box statistics
are computed. Set it FALSE to keep the raw importance scale,
which is what type = "raw" in plot.gg_varpro needs.
Numeric; the z-score above which a variable is treated as
selected. Default 0.79. A variable with aggregate z above the
cutoff is flagged selected = TRUE in $imp.
Logical; default FALSE. When TRUE,
$imp.tree is kept so plot.gg_varpro can scatter the
per-tree points over the box.
Logical; default FALSE. When TRUE, and
only for a classification forest, the $conditional.z matrix is
extracted and stored as $conditional.
Integer; keep only the top nvar variables, ranked by
median per-tree z, after the cutoff filter has been applied.
NULL keeps all of them.
Additional arguments passed to varPro::importance().
Permutation importance asks "what happens to OOB accuracy when I scramble this variable?" That works, but it leans on artificial data (the permuted column) and the answer can be unstable when variables are correlated. The varpro framework (Lu and Ishwaran, 2024) replaces permutation with release rules. The forest is grown with guided splitting; from a subset of trees varpro samples a collection of decision-rule branches; for each variable it then compares the response inside the rule's region to the response after the rule's constraint on that variable is "released". The size of that change, aggregated over many rules and trees, is the variable's importance. No synthetic covariates, no permutation: the contrast is between two real subsets of the data.
Because varpro builds importance from rules sampled over trees, every
tree contributes its own importance value for each variable. Those are
the per-tree scores we summarise here. With local.std = TRUE
(the default) the per-tree values are standardised by their column
standard deviation so the column mean equals the aggregate z-score
returned by varPro::importance(); that z-score is the canonical
"is this variable in or out?" statistic, and cutoff = 0.79 is
varpro's default selection threshold.
For a classification forest, varpro also returns a class-conditional
z table: the same importance computed restricting attention to rules
relevant to each class. conditional = TRUE keeps that table so
the plot method can show which variables matter for which class
rather than only in aggregate.
$imp is the one-row-per-variable summary: aggregate z from
varPro::importance(), plus a selected flag for
z > cutoff. $stats holds the box quantiles
(5/15/50/85/95 percentiles, plus the raw mean) computed from the
per-tree matrix; these are what the boxplot draws. $imp.tree
is the per-tree matrix itself, kept only when faithful = TRUE
so the plot method can scatter individual tree values over the box.
$conditional is the tidy class x variable z table, present
only when conditional = TRUE and the family is
classification.
rank candidate variables by importance and pick a working set above varpro's z cutoff;
see, via the boxplot's spread and the per-tree points
(faithful = TRUE), how stable each variable's importance
is across trees: a high median with a wide box is a different
story from a high median with a tight box;
for a classification forest, ask which variables drive which
class (conditional = TRUE) rather than just which
variables drive the model overall.
The z-score is a standardised ranking statistic, not a p-value or a
probability. Two variables with the same z are "similarly important
by this method", not "equally likely to be true signal". For a
data-driven cutoff rather than the 0.79 default, see
varPro::cv.varpro.
Lu, M. and Ishwaran, H. (2024). Model-independent variable selection via the rule-based variable priority framework. arXiv preprint arXiv:2409.09003.
plot.gg_varpro, gg_vimp
# \donttest{
set.seed(42)
vp <- varPro::varpro(mpg ~ ., data = mtcars, ntree = 50)
gg <- gg_varpro(vp)
print(gg)
plot(gg)
# }
Run the code above in your browser using DataLab