vim: Nonparametric Intrinsic Variable Importance Estimates and Inference

Description

Compute estimates of and confidence intervals for nonparametric intrinsic variable importance based on the population-level contrast between the oracle predictiveness using the feature(s) of interest versus not.

Usage

vim(
  Y = NULL,
  X = NULL,
  f1 = NULL,
  f2 = NULL,
  indx = 1,
  type = "r_squared",
  run_regression = TRUE,
  SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
  alpha = 0.05,
  delta = 0,
  scale = "identity",
  na.rm = FALSE,
  sample_splitting = TRUE,
  sample_splitting_folds = NULL,
  stratified = FALSE,
  C = rep(1, length(Y)),
  Z = NULL,
  ipc_weights = rep(1, length(Y)),
  ipc_est_type = "aipw",
  scale_est = TRUE,
  bootstrap = FALSE,
  b = 1000,
  ...
)

Value

An object of classes vim and the type of risk-based measure. See Details for more information.

Arguments

Y: the outcome.
X: the covariates.
f1: the fitted values from a flexible estimation technique regressing Y on X.
f2: the fitted values from a flexible estimation technique regressing either (a) f1 or (b) Y on X withholding the columns in indx.
indx: the indices of the covariate(s) to calculate variable importance for; defaults to 1.
type: the type of importance to compute; defaults to r_squared, but other supported options are auc, accuracy, deviance, and anova.
run_regression: if outcome Y and covariates X are passed to vimp_accuracy, and run_regression is TRUE, then Super Learner will be used; otherwise, variable importance will be computed using the inputted fitted values.
SL.library: a character vector of learners to pass to SuperLearner, if f1 and f2 are Y and X, respectively. Defaults to SL.glmnet, SL.xgboost, and SL.mean.
alpha: the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval.
delta: the value of the $\delta$-null (i.e., testing if importance < $\delta$); defaults to 0.
scale: should CIs be computed on original ("identity") or logit ("logit") scale?
na.rm: should we remove NAs in the outcome and fitted values in computation? (defaults to FALSE)
sample_splitting: should we use sample-splitting to estimate the full and reduced predictiveness? Defaults to TRUE, since inferences made using sample_splitting = FALSE will be invalid for variable with truly zero importance.
sample_splitting_folds: the folds used for sample-splitting; these identify the observations that should be used to evaluate predictiveness based on the full and reduced sets of covariates, respectively. Only used if run_regression = FALSE.
stratified: if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-validation folds)
C: the indicator of coarsening (1 denotes observed, 0 denotes unobserved).
Z: either (i) NULL (the default, in which case the argument C above must be all ones), or (ii) a character vector specifying the variable(s) among Y and X that are thought to play a role in the coarsening mechanism.
ipc_weights: weights for the computed influence curve (i.e., inverse probability weights for coarsened-at-random settings). Assumed to be already inverted (i.e., ipc_weights = 1 / [estimated probability weights]).
ipc_est_type: the type of procedure used for coarsened-at-random settings; options are "ipw" (for inverse probability weighting) or "aipw" (for augmented inverse probability weighting). Only used if C is not all equal to 1.
scale_est: should the point estimate be scaled to be greater than 0? Defaults to TRUE.
bootstrap: should bootstrap-based standard error estimates be computed? Defaults to FALSE (and currently may only be used if sample_splitting = FALSE).
b: the number of bootstrap replicates (only used if bootstrap = TRUE and sample_splitting = FALSE).
...: other arguments to the estimation tool, see "See also".

Details

We define the population variable importance measure (VIM) for the group of features (or single feature) $s$ with respect to the predictiveness measure $V$ by $$\psi_{0,s} := V(f_0, P_0) - V(f_{0,s}, P_0),$$ where $f_0$ is the population predictiveness maximizing function, $f_{0,s}$ is the population predictiveness maximizing function that is only allowed to access the features with index not in $s$, and $P_0$ is the true data-generating distribution. VIM estimates are obtained by obtaining estimators $f_n$ and $f_{n,s}$ of $f_0$ and $f_{0,s}$, respectively; obtaining an estimator $P_n$ of $P_0$; and finally, setting $\psi_{n,s} := V(f_n, P_n) - V(f_{n,s}, P_n)$.

In the interest of transparency, we return most of the calculations within the vim object. This results in a list including:

s: the column(s) to calculate variable importance for
SL.library: the library of learners passed to SuperLearner
type: the type of risk-based variable importance measured
full_fit: the fitted values of the chosen method fit to the full data
red_fit: the fitted values of the chosen method fit to the reduced data
est: the estimated variable importance
naive: the naive estimator of variable importance (only used if type = "anova")
eif: the estimated efficient influence function
eif_full: the estimated efficient influence function for the full regression
eif_reduced: the estimated efficient influence function for the reduced regression
se: the standard error for the estimated variable importance
ci: the $(1-\alpha) \times 100$% confidence interval for the variable importance estimate
test: a decision to either reject (TRUE) or not reject (FALSE) the null hypothesis, based on a conservative test
p_value: a p-value based on the same test as test
full_mod: the object returned by the estimation procedure for the full data regression (if applicable)
red_mod: the object returned by the estimation procedure for the reduced data regression (if applicable)
alpha: the level, for confidence interval calculation
sample_splitting_folds: the folds used for sample-splitting (used for hypothesis testing)
y: the outcome
ipc_weights: the weights
mat: a tibble with the estimate, SE, CI, hypothesis testing decision, and p-value

Examples

Run this code

# generate the data
# generate X
p <- 2
n <- 100
x <- data.frame(replicate(p, stats::runif(n, -1, 1)))

# apply the function to the x's
f <- function(x) 0.5 + 0.3*x[1] + 0.2*x[2]
smooth <- apply(x, 1, function(z) f(z))

# generate Y ~ Bernoulli (smooth)
y <- matrix(rbinom(n, size = 1, prob = smooth))

# set up a library for SuperLearner; note simple library for speed
library("SuperLearner")
learners <- c("SL.glm")

# using Y and X; use class-balanced folds
est_1 <- vim(y, x, indx = 2, type = "accuracy",
           alpha = 0.05, run_regression = TRUE,
           SL.library = learners, cvControl = list(V = 2),
           stratified = TRUE)

# using pre-computed fitted values
set.seed(4747)
V <- 2
y_1 <- y[est_1$sample_splitting_folds == 1]
y_2 <- y[est_1$sample_splitting_folds == 2]
x_1 <- subset(x, est_1$sample_splitting_folds == 1)
x_2 <- subset(x, est_1$sample_splitting_folds == 2)
full_fit <- SuperLearner::SuperLearner(Y = y_1, X = x_1,
                                       SL.library = learners,
                                       cvControl = list(V = V))
full_fitted <- SuperLearner::predict.SuperLearner(full_fit)$pred
# fit the data with only X1
full_fit_2 <- SuperLearner::SuperLearner(Y = y_2, X = x_2, 
                                         SL.library = learners,
                                         cvControl = list(V = V))
full_fitted_2 <- SuperLearner::predict.SuperLearner(full_fit_2)$pred
reduced_fit <- SuperLearner::SuperLearner(Y = full_fitted_2,
                                          X = x_2[, -2, drop = FALSE],
                                          SL.library = learners,
                                          cvControl = list(V = V))
reduced_fitted <- SuperLearner::predict.SuperLearner(reduced_fit)$pred

est_2 <- vim(Y = y, f1 = full_fitted, f2 = reduced_fitted,
            indx = 2, run_regression = FALSE, alpha = 0.05,
            stratified = TRUE, type = "accuracy",
            sample_splitting_folds = est_1$sample_splitting_folds)