Explain: Approximate Shapley Values

Description

Compute fast (approximate) Shapley values for a set of features using the Monte Carlo algorithm described in Strumbelj and Igor (2014). An efficient algorithm for tree-based models, commonly referred to as Tree SHAP, is also supported for lightgbm(https://cran.r-project.org/package=lightgbm) and xgboost(https://cran.r-project.org/package=xgboost) models; see Lundberg et. al. (2020) for details.

Usage

Explain(object, ...)
# S3 method for default
Explain(
  object,
  feature_names = NULL,
  X = NULL,
  nsim = 1,
  pred_wrapper = NULL,
  newdata = NULL,
  parallel = FALSE,
  ...
)
# S3 method for lm
Explain(
  object,
  feature_names = NULL,
  X,
  nsim = 1,
  pred_wrapper,
  newdata = NULL,
  exact = FALSE,
  parallel = FALSE,
  ...
)
# S3 method for xgb.Booster
Explain(
  object,
  feature_names = NULL,
  X = NULL,
  nsim = 1,
  pred_wrapper,
  newdata = NULL,
  exact = FALSE,
  parallel = FALSE,
  ...
)
# S3 method for lgb.Booster
Explain(
  object,
  feature_names = NULL,
  X = NULL,
  nsim = 1,
  pred_wrapper,
  newdata = NULL,
  exact = FALSE,
  parallel = FALSE,
  ...
)

Value

An object of class Explain with the following components :

newdata: The data frame formatted dataset employed for the estimation of Shapley values. If a variable has categories, categorical variables are one-hot encoded.
phis: A list format containing Shapley values for individual variables.
fnull: The expected value of the model's predictions.
fx: The prediction value for each observation.
factor_names: The name of the categorical variable. If the data contains only continuous or dummy variables, it is set to NULL.

Arguments

object

A fitted model object (e.g., a ranger::ranger(), or xgboost::xgboost(),object, to name a few).

...

Additional arguments to be passed

feature_names

Character string giving the names of the predictor variables (i.e., features) of interest. If NULL(default) they will be taken from the column names of X.

X

A matrix-like R object (e.g., a data frame or matrix) containing ONLY the feature columns from the training data (or suitable background data set). If the input includes categorical variables that need to be one-hot encoded, please input data that has been processed using data.table::one_hot(). In XGBoost, the input should be the raw dataset containing only the explanatory variables, not the data created using xgb.DMatrix. **NOTE:** This argument is required whenever exact = FALSE.

nsim

The number of Monte Carlo repetitions to use for estimating each Shapley value (only used when exact = FALSE). Default is 1. **NOTE:** To obtain the most accurate results, nsim should be set as large as feasibly possible.

pred_wrapper

Prediction function that requires two arguments, object and newdata. **NOTE:** This argument is required whenever exact = FALSE. The output of this function should be determined according to:

Regression: A numeric vector of predicted outcomes.

Binary classification

A vector of predicted class probabilities for the reference class.

Multiclass classification

A vector of predicted class probabilities for the reference class.

newdata

A matrix-like R object (e.g., a data frame or matrix) containing ONLY the feature columns for the observation(s) of interest; that is, the observation(s) you want to compute explanations for. Default is NULL which will produce approximate Shapley values for all the rows in X (i.e., the training data). If the input includes categorical variables that need to be one-hot encoded, please input data that has been processed using data.table::one_hot().

parallel

Logical indicating whether or not to compute the approximate Shapley values in parallel across features; default is FALSE. **NOTE:** setting parallel = TRUE requires setting up an appropriate (i.e., system-specific) *parallel backend* as described in the foreach(https://cran.r-project.org/package=foreach); for details, see vignette("foreach", package = "foreach") in R.

exact

Logical indicating whether to compute exact Shapley values. Currently only available for stats::lm()(https://CRAN.R-project.org/package=STAT), xgboost::xgboost() (https://CRAN.R-project.org/package=xgboost), and lightgbm::lightgbm()(https://CRAN.R-project.org/package=lightgbm) objects. Default is FALSE. Note that setting exact = TRUE will return explanations for each of the stats::terms() in an stats::lm() object. Default is FALSE.

References

Strumbelj, E., and Igor K. (2014). Explaining prediction models and individual predictions with feature contributions. Knowledge and information systems, 41(3), 647-665.

Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., and Lee, Su-In (2020). From local explanations to global understanding with Explainable AI for trees. Nature Machine Intelligence, 2(1), 2522–5839.

Examples

Run this code

# \donttest{
#
# A projection pursuit regression (PPR) example
#

# Load the sample data; see datasets::mtcars for details
data(mtcars)

# Fit a projection pursuit regression model
fit <- ppr(mpg ~ ., data = mtcars, nterms = 5)

# Prediction wrapper
pfun <- function(object, newdata) {  # needs to return a numeric vector
  predict(object, newdata = newdata)  
}

# Compute approximate Shapley values using 10 Monte Carlo simulations
set.seed(101)  # for reproducibility
shap <- Explain(fit, X = subset(mtcars, select = -mpg), nsim = 10, 
                pred_wrapper = pfun)
# }

Run the code above in your browser using DataLab