shapleySubsetMc: Estimation of Shapley effects from data using nearest neighbors method

Description

shapleySubsetMc implements the estimation of the Shapley effects from data using some nearest neighbors method to generate according to the conditional distributions of the inputs. It can be used with categorical inputs.

Usage

shapleySubsetMc(X,Y, Ntot=NULL, Ni=3, cat=NULL, weight=NULL, discrete=NULL, noise=FALSE)
# S3 method for shapleySubsetMc
plot(x, ylim = c(0, 1), ...)

Value

shapleySubsetMc returns a list of class "shapleySubsetMc", containing:

shapley: the Shapley effects estimates.
cost: the real total cost of these estimates: the total number of points for which the nearest neighbours were computed.
names: the labels of the input variables.

Arguments

X: a matrix or a dataframe of the input sample
Y: a vector of the output sample
Ntot: an integer of the approximate cost wanted
Ni: the number of nearest neighbours taken for each point
cat: a vector giving the indices of the input categorical variables
weight: a vector with the same length of cat giving the weight of each categorical variable in the product distance
discrete: a vector giving the indices of the input variable that are real, and not categorical, but that can take several times the same values
noise: logical. If FALSE (the default), the variable Y is a function of X
x: a list of class "shapleySubsetMc" storing the state of the sensitivity study (Shapley effects, cost, names of inputs)
ylim: y-coordinate plotting limits
...: any other arguments for plotting

Author

Baptiste Broto

Details

If weight = NULL, all the categorical variables will have the same weight 1.

If Ntot = NULL, the nearest neighbours will be compute for all the \(n (2^p-2)\) points, where n is the length of the sample. The estimation can be very long with this parameter.

References

B. Broto, F. Bachoc, M. Depecker, 2020, Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution, SIAM/ASA Journal of Uncertainty Quantification, 8:693-716.

Examples

Run this code

# \donttest{

# First example: the linear Gaussian framework

# we generate a covariance matrice Sigma
p <- 4 #dimension
A <- matrix(rnorm(p^2),nrow=p,ncol=p)
Sigma <- t(A)%*%A # it means t(A)%*%A
C <- chol(Sigma)
n <- 500 #sample size (put n=2000 for more consistency)

Z=matrix(rnorm(p*n),nrow=n,ncol=p)
X=Z%*%C # X is a gaussian vector with zero mean and covariance Sigma
Y=rowSums(X) 
Shap=shapleySubsetMc(X=X,Y=Y,Ntot=5000)
plot(Shap)


#Second example: The Sobol model with heterogeneous inputs

p=8 #dimension
A=matrix(rnorm(p^2),nrow=p,ncol=p)
Sigma=t(A)%*%A
C=chol(Sigma)
n=500 #sample size (put n=5000 for more consistency)

Z=matrix(rnorm(p*n),nrow=n,ncol=p)
X=Z

#we create discrete and categorical variables
X[,1]=round(X[,1]/2) 
X[,2]=X[,2]>2
X[,4]=-2*round(X[,4])+4
X[(X[,6]>0 &X[,6]<1),6]=1

cat=c(1,2)  # we choose to take X1 and X2 as categorical variables (with the discrete distance)
discrete=c(4,6) # we indicate that X4 and X6 can take several times the same value

Y=sobol.fun(X)
Ntot <- 2000 # put Ntot=20000 for more consistency
Shap=shapleySubsetMc(X=X,Y=Y, cat=cat, discrete=discrete, Ntot=Ntot, Ni=10)

plot(Shap)
# }

Run the code above in your browser using DataLab