emvd: Exclusion-respecting Marginal explained-Variance Decomposition indices for linear and logistic models

Description

emvd computes the E-MVD indices derived from Feldman (2005) applied to the explained variance ($R^2$) as a performance metric. They are also known under the name PMVD (Proportional Marginal Variance Decomposition). They allow for relative importance indices by $R^2$ decomposition for linear and logistic regression models. These indices allocate a share of $R^2$ to each input based on a Proportional attribution system, allowing for covariates with null regression coefficients to have indices equal to 0, despite their potential dependence with other covariates (Exclusion principle).

Usage

emvd(X, y, logistic = FALSE, tol = NULL, rank = FALSE, nboot = 0, 
    conf = 0.95, max.iter = 1000, parl = NULL)
# S3 method for emvd
print(x, ...)
# S3 method for emvd
plot(x, ylim = c(0,1), ...)

Value

emvd returns a list of class "emvd", containing the following components:

call: the matched call.
emvd: the estimations of the E-MVD indices.
R2s: the estimations of the $R^2$ for all possible sub-models.
indices: list of all subsets corresponding to the structure of R2s.
P: the values of $P(.)$ of all subsets for recursive computing. Equal to NULL if bootstrap estimates are made.
conf_int: a matrix containing the estimations, biais and confidence intervals by bootstrap (if nboot>0).
X: the observed covariates.
y: the observed outcomes.
logistic: logical. TRUE if the analysis has been made by logistic regression.
boot: logical. TRUE if bootstrap estimates have been produced.
nboot: number of bootstrap replicates.
rank: logical. TRUE if a rank analysis has been made.
parl: number of chosen cores for the computation.
conf: level for the confidence intervals by bootstrap.

Arguments

X: a matrix or data frame containing the observed covariates (i.e., features, input variables...).
y: a numeric vector containing the observed outcomes (i.e., dependent variable). If logistic=TRUE, can be a numeric vector of zeros and ones, or a logical vector, or a factor.
logistic: logical. If TRUE, the analysis is done via a logistic regression(binomial GLM).
tol: covariates with absolute marginal contributions less or equal to tol are omitted. By default, if tol=NULL, only covariates with no marginal contribution are omitted.
rank: logical. If TRUE, the analysis is done on the ranks.
nboot: the number of bootstrap replicates for the computation of confidence intervals.
conf: the confidence level of the bootstrap confidence intervals.
max.iter: if logistic=TRUE, the maximum number of iterative optimization steps allowed for the logistic regression. Default is 1000.
parl: number of cores on which to parallelize the computation. If NULL, then no parallelization is done.
x: the object returned by lmg.
ylim: the y-coordinate limits of the plot.
...: arguments to be passed to methods, such as graphical parameters (see par).

Author

Marouane Il Idrissi

Details

The computation of the E-MVD is done using the recursive method defined in Feldman (2005), but using the subset procedure defined in Broto, Bachoc and Depecker (2020), that is computing all the $R^2$ for all possible sub-models first, and then computing $P(.)$ recursively for all subsets of covariates.

For logistic regression (logistic=TRUE), the $R^2$ value is equal to: $$R^2 = 1-\frac{\textrm{model deviance}}{\textrm{null deviance}}$$

If either a logistic regression model (logistic = TRUE), or any column of X is categorical (i.e., of class factor), then the rank-based indices cannot be computed. In both those cases, rank = FALSE is forced by default (with a warning).

If too many cores for the machine are passed on to the parl argument, the chosen number of cores is defaulted to the available cores minus one.

Spurious covariates are defined by the tol argument. If null, then covariates with: $$w(\{i\}) = 0$$ are omitted, and their emvd index is set to zero. In other cases, the spurious covariates are detected by: $$|w(\{i\})| \leq \textrm{tol}$$

References

Broto B., Bachoc F. and Depecker M. (2020) Variance Reduction for Estimation of Shapley Effects and Adaptation to Unknown Input Distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2).

D.V. Budescu (1993). Dominance analysis: A new approach to the problem of relative importance of predictors in multiple regression. Psychological Bulletin, 114:542-551.

Feldman, B. (2005) Relative Importance and Value SSRN Electronic Journal.

U. Gromping (2006). Relative importance for linear regression in R: the Package relaimpo. Journal of Statistical Software, 17:1-27.

M. Il Idrissi, V. Chabridon and B. Iooss (2021). Mesures d'importance relative par decompositions de la performance de modeles de regression, Actes des 52emes Journees de Statistiques de la Societe Francaise de Statistique (SFdS), pp 497-502, Nice, France, Juin 2021

Examples

Run this code

library(parallel)
library(gtools)
library(boot)

library(mvtnorm)

set.seed(1234)
n <- 100
beta<-c(1,-2,3)
sigma<-matrix(c(1,0,0,
                0,1,-0.8,
                0,-0.8,1),
              nrow=3,
              ncol=3)

############################
# Gaussian correlated inputs

X <-rmvnorm(n, rep(0,3), sigma)

#############################
# Linear Model

y <- X%*%beta + rnorm(n)

# Without Bootstrap confidence intervals
x<-emvd(X, y)
print(x)
plot(x)

# With Boostrap confidence intervals
x<-emvd(X, y, nboot=100, conf=0.95)
print(x)
plot(x)

# Rank-based analysis
x<-emvd(X, y, rank=TRUE, nboot=100, conf=0.95)
print(x)
plot(x)

############################
# Logistic Regression
y<-as.numeric(X%*%beta + rnorm(n)>0)
x<-emvd(X,y, logistic = TRUE)
plot(x)
print(x)

# Parallel computing
#x<-emvd(X,y, logistic = TRUE, parl=2)
#plot(x)
#print(x)

Run the code above in your browser using DataLab