Tools to Support Relative Importance Analysis
Overview
The {domir}
package supports the determination of the relative
importance of the inputs (i.e., independent variables, predictors, or
features) in a user’s statistical or machine learning model. The
methodology used by {domir}
is called Dominance Analysis which is
based on a series of pairwise comparisons between the model fit values
ascribed to elements in the model including comparing Shapley
values.
The intention of this package is to provide a flexible user interface to
dominance analysis—a relatively assumption-free methodology for
comparing the value of model inputs to prediction. The user interface is
structured such that {domir}
automates the decomposition of the
returned value and comparisons between model inputs and the user
provides the model inputs, the predictive model into which they are
entered, and returned value from the model to decompose.
Installation
To install the most recent version of {domir}
from CRAN use:
install.packages("domir")
{domir}
is also used as the computational engine underlying the
dominance_analysis()
function for the {parameters}
package from the {easystats}
collection.
What {domir}
Does
The primary dominance analysis function domir
implements the most
computationally intensive and programming heavy parts of dominance
analysis for the user and has relatively few requirements on the
predictive modeling functions with which it can work.
The flexibility of domir
comes at the cost of more complexity for the
user in terms of setting up a function that accepts the type of input
domir
will provide (currently only a ‘formula’) and and expects to
receive (currently only a numeric scalar).
Below these ideas are outlined in greater detail in the context of a few
examples. The next section begins the discussion with a more extensive
comparison of domir
with packages that implement similar methods.
Comparison with Existing Relative Importance Packages
The domir
function implements the same method as the “lmg” type for
the calc.relimpo
function in the {relaimpo}
package. domir
can
replicate the results produced by both the above package but, as will be
seen, requires more user input.
To illustrate these points, consider the following example linear regression on which all three of the dominance analysis results to come are based:
lm(mpg ~ am + vs + cyl, data = mtcars)
Classic dominance analysis uses the variance explained $R^2$ as fit
statistic (i.e., as implemented by lm
’s summary
method) and so will
this example.
{domir}
’s domir
Implementing a ‘classic’ dominance analysis on this linear regression in
domir
can be inputted as:
lm_wrapper <-
function(formula, data) {
lm(formula, data = data) |>
summary() |>
_[["r.squared"]]
}
domir(mpg ~ am + vs + cyl, lm_wrapper, data = mtcars)
## Overall Value: 0.7619773
##
## General Dominance Values:
## General Dominance Standardized Ranks
## am 0.1774892 0.2329324 3
## vs 0.2027032 0.2660226 2
## cyl 0.3817849 0.5010450 1
##
## Conditional Dominance Values:
## Subset Size: 1 Subset Size: 2 Subset Size: 3
## am 0.3597989 0.1389842 0.033684441
## vs 0.4409477 0.1641982 0.002963748
## cyl 0.7261800 0.3432799 0.075894823
##
## Complete Dominance Designations:
## Dmnated?am Dmnated?vs Dmnated?cyl
## Dmnates?am NA NA FALSE
## Dmnates?vs NA NA FALSE
## Dmnates?cyl TRUE TRUE NA
In domir
, the lm
model is not submitted directly. Rather, it is
wrapped into a function (i.e., lm_wrapper
) that, in this case, accepts
two arguments; formula or an R formula and data a data frame in
which the independent variables in the formula are present. The result
of the lm
submitted into the summary
function and the result is then
filtered to just the r.squared element and returned.
What domir
does automate taking subsets of the formula and submit
them, repeatedly until all possible subsets have been submitted, to
lm_wrapper
(see this
vignette
for a conceptual discussion of dominance analysis). In this way, domir
is a Map
- or lapply
-like function as it receives an object on which
to operate (i.e., the formula) and a function to which to apply to it.
domir
expects a numeric scalar to be returned from the function.
Like lapply
, other arguments (data = mtcars
) can also be passed to
each call of the function and must be explicitly built into the wrapper
function.
What is important to note about domir
that differs from other
dominance analysis-oriented functions discussed below is that domir
expects that the user will supply the analysis pipeline linking the
formula it passes to the numeric scalar value that it expects. This
‘supply the pipeline’ approach makes domir
far more flexible than
other implementations but does require the user to think more carefully
about how to structure the pipeline.
Note that the focus of domir
’s print
-ed results focuses on the
numerical results from “General Dominance Values” and “Conditional
Dominance Values” and, a logical matrix of “Complete Dominance
Designations”.
See also the (now superseded) domir::domin
function for another
approach to structuring the input pipeline for dominance analysis.
{relaimpo}
’s calc.relimp
with type = "lmg"
{relaimpo}
is not a dominance analysis software but does produce
general dominance value decomposition for linear regression using the
explained variance $R^2$ in the calc.relimp
function with the argument
type = "lmg"
.
relaimpo::calc.relimp(mpg ~ am + vs + cyl, data = mtcars, type = "lmg")
## Response variable: mpg
## Total response variance: 36.3241
## Analysis based on 32 observations
##
## 3 Regressors:
## am vs cyl
## Proportion of variance explained by model: 76.2%
## Metrics are not normalized (rela=FALSE).
##
## Relative importance metrics:
##
## lmg
## am 0.1774892
## vs 0.2027032
## cyl 0.3817849
##
## Average coefficients for different model sizes:
##
## 1X 2Xs 3Xs
## am 7.244939 4.316851 3.026480
## vs 7.940476 2.995142 1.294614
## cyl -2.875790 -2.795816 -2.137632
calc.relimp
has a similar to structure to that of domir
but does not
require a pipeline function. This is because {relaimpo}
is specialized
and works only with lm
models and the variance explained $R^2$ as a
fit statistic. calc.relimp
also allows for multiple methods of
submitting (i.e., correlation matrices, fitted lm
object, a
data.frame
) given that it always implements the same model.
calc.relimp
’s printed results provide relative importance metric
values that match those obtained from domir
(i.e., the general
dominance values). In addition, calc.relimp
reports the average lm
coefficients across numbers of independent variables/$X$s in a way
similar to the conditional dominance values reported by domir
—an
additional and useful result to show the impact of inclusion of
different numbers of independent variables on obtained
coefficients/predicted values.
Again, note that {relaimpo}
is not dominance analysis-oriented and
does not report on dominance designations or dominance values other than
the general dominance values.
Further Examples
Further examples of domir
s functionality will be populated on the
{domir}
wiki.