Compute estimates of and confidence intervals for nonparametric
ANOVA-based intrinsic variable importance. This is a wrapper function for
cv_vim
, with type = "anova"
.
This function is deprecated in vimp
version 2.0.0.
vimp_regression(
Y = NULL,
X = NULL,
cross_fitted_f1 = NULL,
cross_fitted_f2 = NULL,
indx = 1,
V = 10,
run_regression = TRUE,
SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
alpha = 0.05,
delta = 0,
na.rm = FALSE,
cross_fitting_folds = NULL,
stratified = FALSE,
C = rep(1, length(Y)),
Z = NULL,
ipc_weights = rep(1, length(Y)),
scale = "identity",
ipc_est_type = "aipw",
scale_est = TRUE,
cross_fitted_se = TRUE,
...
)
An object of classes vim
and vim_regression
.
See Details for more information.
the outcome.
the covariates.
the predicted values on validation data from a flexible estimation technique regressing Y on X in the training data; a list of length V, where each object is a set of predictions on the validation data. If sample-splitting is requested, then these must be estimated specially; see Details.
the predicted values on validation data from a
flexible estimation technique regressing either (a) the fitted values in
cross_fitted_f1
, or (b) Y, on X withholding the columns in indx
;
a list of length V, where each object is a set of predictions on the
validation data. If sample-splitting is requested, then these must
be estimated specially; see Details.
the indices of the covariate(s) to calculate variable importance for; defaults to 1.
the number of folds for cross-fitting, defaults to 5. If
sample_splitting = TRUE
, then a special type of V
-fold cross-fitting
is done. See Details for a more detailed explanation.
if outcome Y and covariates X are passed to
cv_vim
, and run_regression
is TRUE
, then Super Learner
will be used; otherwise, variable importance will be computed using the
inputted fitted values.
a character vector of learners to pass to
SuperLearner
, if f1
and f2
are Y and X, respectively.
Defaults to SL.glmnet
, SL.xgboost
, and SL.mean
.
the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval.
the value of the \(\delta\)-null (i.e., testing if importance < \(\delta\)); defaults to 0.
should we remove NA's in the outcome and fitted values in
computation? (defaults to FALSE
)
the folds for cross-fitting. Only used if
run_regression = FALSE
.
if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-fitting folds)
the indicator of coarsening (1 denotes observed, 0 denotes unobserved).
either (i) NULL (the default, in which case the argument
C
above must be all ones), or (ii) a character vector specifying
the variable(s) among Y and X that are thought to play a role in the
coarsening mechanism.
weights for the computed influence curve (i.e., inverse probability weights for coarsened-at-random settings). Assumed to be already inverted (i.e., ipc_weights = 1 / [estimated probability weights]).
should CIs be computed on original ("identity", default) or logit ("logit") scale?
the type of procedure used for coarsened-at-random
settings; options are "ipw" (for inverse probability weighting) or
"aipw" (for augmented inverse probability weighting).
Only used if C
is not all equal to 1.
should the point estimate be scaled to be greater than 0?
Defaults to TRUE
.
should we use cross-fitting to estimate the standard
errors (TRUE
, the default) or not (FALSE
)?
other arguments to the estimation tool, see "See also".
We define the population ANOVA parameter for the group of features (or single feature) \(s\) by $$\psi_{0,s} := E_0\{f_0(X) - f_{0,s}(X)\}^2/var_0(Y),$$ where \(f_0\) is the population conditional mean using all features, \(f_{0,s}\) is the population conditional mean using the features with index not in \(s\), and \(E_0\) and \(var_0\) denote expectation and variance under the true data-generating distribution, respectively.
Cross-fitted ANOVA estimates are computed by first splitting the data into \(K\) folds; then using each fold in turn as a hold-out set, constructing estimators \(f_{n,k}\) and \(f_{n,k,s}\) of \(f_0\) and \(f_{0,s}\), respectively on the training data and estimator \(E_{n,k}\) of \(E_0\) using the test data; and finally, computing $$\psi_{n,s} := K^{(-1)}\sum_{k=1}^K E_{n,k}\{f_{n,k}(X) - f_{n,k,s}(X)\}^2/var_n(Y),$$ where \(var_n\) is the empirical variance. See the paper by Williamson, Gilbert, Simon, and Carone for more details on the mathematics behind this function.
SuperLearner
for specific usage of the SuperLearner
function and package.
# generate the data
# generate X
p <- 2
n <- 100
x <- data.frame(replicate(p, stats::runif(n, -5, 5)))
# apply the function to the x's
smooth <- (x[,1]/5)^2*(x[,1]+7)/5 + (x[,2]/3)^2
# generate Y ~ Normal (smooth, 1)
y <- smooth + stats::rnorm(n, 0, 1)
# set up a library for SuperLearner; note simple library for speed
library("SuperLearner")
learners <- c("SL.glm", "SL.mean")
# estimate (with a small number of folds, for illustration only)
est <- vimp_regression(y, x, indx = 2,
alpha = 0.05, run_regression = TRUE,
SL.library = learners, V = 2, cvControl = list(V = 2))
Run the code above in your browser using DataLab