Compute fast (approximate) Shapley values for a set of features using the Monte Carlo algorithm described in Strumbelj and Igor (2014). An efficient algorithm for tree-based models, commonly referred to as Tree SHAP, is also supported for lightgbm(https://cran.r-project.org/package=lightgbm) and xgboost(https://cran.r-project.org/package=xgboost) models; see Lundberg et. al. (2020) for details.
Explain(object, ...)# S3 method for default
Explain(
object,
feature_names = NULL,
X = NULL,
nsim = 1,
pred_wrapper = NULL,
newdata = NULL,
parallel = FALSE,
...
)
# S3 method for lm
Explain(
object,
feature_names = NULL,
X,
nsim = 1,
pred_wrapper,
newdata = NULL,
exact = FALSE,
parallel = FALSE,
...
)
# S3 method for xgb.Booster
Explain(
object,
feature_names = NULL,
X = NULL,
nsim = 1,
pred_wrapper,
newdata = NULL,
exact = FALSE,
parallel = FALSE,
...
)
# S3 method for lgb.Booster
Explain(
object,
feature_names = NULL,
X = NULL,
nsim = 1,
pred_wrapper,
newdata = NULL,
exact = FALSE,
parallel = FALSE,
...
)
An object of class Explain
with the following components :
The data frame formatted dataset employed for the estimation of Shapley values. If a variable has categories, categorical variables are one-hot encoded.
A list format containing Shapley values for individual variables.
The expected value of the model's predictions.
The prediction value for each observation.
The name of the categorical variable.
If the data contains only continuous or dummy variables, it is set to NULL
.
A fitted model object (e.g., a
ranger::ranger()
, or xgboost::xgboost()
,object, to name a few).
Additional arguments to be passed
Character string giving the names of the predictor
variables (i.e., features) of interest. If NULL
(default) they will be
taken from the column names of X
.
A matrix-like R object (e.g., a data frame or matrix) containing
ONLY the feature columns from the training data (or suitable background data
set). If the input includes categorical variables that need to be one-hot encoded,
please input data that has been processed using data.table::one_hot()
.
In XGBoost, the input should be the raw dataset containing only the explanatory variables,
not the data created using xgb.DMatrix
.
**NOTE:** This argument is required whenever exact = FALSE
.
The number of Monte Carlo repetitions to use for estimating each
Shapley value (only used when exact = FALSE
). Default is 1
.
**NOTE:** To obtain the most accurate results, nsim
should be set
as large as feasibly possible.
Prediction function that requires two arguments,
object
and newdata
. **NOTE:** This argument is required
whenever exact = FALSE
. The output of this function should be
determined according to:
A numeric vector of predicted outcomes.
A vector of predicted class probabilities for the reference class.
A vector of predicted class probabilities for the reference class.
A matrix-like R object (e.g., a data frame or matrix)
containing ONLY the feature columns for the observation(s) of interest; that
is, the observation(s) you want to compute explanations for. Default is
NULL
which will produce approximate Shapley values for all the rows in
X
(i.e., the training data).
If the input includes categorical variables that need to be one-hot encoded,
please input data that has been processed using data.table::one_hot()
.
Logical indicating whether or not to compute the approximate
Shapley values in parallel across features; default is FALSE
. **NOTE:**
setting parallel = TRUE
requires setting up an appropriate (i.e.,
system-specific) *parallel backend* as described in the
foreach(https://cran.r-project.org/package=foreach); for details, see
vignette("foreach", package = "foreach")
in R.
Logical indicating whether to compute exact Shapley values.
Currently only available for stats::lm()
(https://CRAN.R-project.org/package=STAT),
xgboost::xgboost()
(https://CRAN.R-project.org/package=xgboost),
and lightgbm::lightgbm()
(https://CRAN.R-project.org/package=lightgbm) objects.
Default is FALSE
. Note that setting exact = TRUE
will return
explanations for each of the stats::terms()
in an
stats::lm()
object. Default is FALSE
.
Strumbelj, E., and Igor K. (2014). Explaining prediction models and individual predictions with feature contributions. Knowledge and information systems, 41(3), 647-665.
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., and Lee, Su-In (2020). From local explanations to global understanding with Explainable AI for trees. Nature Machine Intelligence, 2(1), 2522–5839.
# \donttest{
#
# A projection pursuit regression (PPR) example
#
# Load the sample data; see datasets::mtcars for details
data(mtcars)
# Fit a projection pursuit regression model
fit <- ppr(mpg ~ ., data = mtcars, nterms = 5)
# Prediction wrapper
pfun <- function(object, newdata) { # needs to return a numeric vector
predict(object, newdata = newdata)
}
# Compute approximate Shapley values using 10 Monte Carlo simulations
set.seed(101) # for reproducibility
shap <- Explain(fit, X = subset(mtcars, select = -mpg), nsim = 10,
pred_wrapper = pfun)
# }
Run the code above in your browser using DataLab