Learn R Programming

⚠️There's a newer version (1.0.4) of this package.Take me there.

shapr

Brief NEWS

This is shapr version 1.0.0 (Released on GitHub Nov 2024), which provides a full restructuring of the code based, and provides a full suit of new functionality, including:

  • A long list of approaches for estimating the contribution/value function $v(S)$, including Variational Autoencoders, and regression-based methods
  • Iterative Shapley value estimation with convergence detection
  • Parallelized computations with progress updates
  • Reweighted Kernel SHAP for faster convergence
  • New function explain_forecast() for explaining forecasts
  • Several other methodological, computational and user-experience improvements
  • Python wrapper making the core functionality of shapr available in Python

Below we provide a brief overview of the breaking changes. See the NEWS for the full list of details.

Breaking changes

The new syntax for explaining models essentially amounts to using a single function (explain()) instead of two functions (shapr() and explain()). In addition, custom models are now explained by passing the prediction function directly to explain(), some input arguments got new names, and a few functions for edge cases was removed to simplify the code base.

Note that the CRAN version of shapr (v0.2.2) still uses the old syntax. The examples below uses the new syntax. Here is a version of this README with the syntax of the CRAN version (v0.2.2).

Python wrapper

We now also provide a Python wrapper (shaprpy) which allows explaining python models with the methodology implemented in shapr, directly from Python. The wrapper is available here.

The package

The shapr R package implements an enhanced version of the Kernel SHAP method, for approximating Shapley values, with a strong focus on conditional Shapley values. The core idea is to remain completely model-agnostic while offering a variety of methods for estimating contribution functions, enabling accurate computation of conditional Shapley values across different feature types, dependencies, and distributions. The package also includes evaluation metrics to compare various approaches. With features like parallelized computations, convergence detection, progress updates, and extensive plotting options, shapr is as a highly efficient and user-friendly tool, delivering precise estimates of conditional Shapley values, which are critical for understanding how features truly contribute to predictions.

A basic example is provided below. Otherwise we refer to the pkgdown website and the vignettes there
for details and further examples.

Installation

We highly recommend to install the development version of shapr (with the new explanation syntax and all functionality),

remotes::install_github("NorskRegnesentral/shapr")

To also install all dependencies, use

remotes::install_github("NorskRegnesentral/shapr", dependencies = TRUE)

The CRAN version of shapr (NOT RECOMMENDED) can be installed with

install.packages("shapr")

Example

shapr supports computation of Shapley values with any predictive model which takes a set of numeric features and produces a numeric outcome.

The following example shows how a simple xgboost model is trained using the airquality dataset, and how shapr explains the individual predictions.

We first enable parallel computation and progress updates with the following code chunk. These are optional, but recommended for improved performance and user friendliness, particularly for problems with many features.

# Enable parallel computation
# Requires the future and future_lapply packages
future::plan("multisession", workers = 2) # Increase the number of workers for increased performance with many features

# Enable progress updates of the v(S)-computations
# Requires the progressr package
progressr::handlers(global = TRUE)
progressr::handlers("cli") # Using the cli package as backend (recommended for the estimates of the remaining time)

Here comes the actual example

library(xgboost)
library(shapr)

data("airquality")
data <- data.table::as.data.table(airquality)
data <- data[complete.cases(data), ]

x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"

ind_x_explain <- 1:6
x_train <- data[-ind_x_explain, ..x_var]
y_train <- data[-ind_x_explain, get(y_var)]
x_explain <- data[ind_x_explain, ..x_var]

# Looking at the dependence between the features
cor(x_train)
#>            Solar.R       Wind       Temp      Month
#> Solar.R  1.0000000 -0.1243826  0.3333554 -0.0710397
#> Wind    -0.1243826  1.0000000 -0.5152133 -0.2013740
#> Temp     0.3333554 -0.5152133  1.0000000  0.3400084
#> Month   -0.0710397 -0.2013740  0.3400084  1.0000000

# Fitting a basic xgboost model to the training data
model <- xgboost(
  data = as.matrix(x_train),
  label = y_train,
  nround = 20,
  verbose = FALSE
)

# Specifying the phi_0, i.e. the expected prediction without any features
p0 <- mean(y_train)

# Computing the actual Shapley values with kernelSHAP accounting for feature dependence using
# the empirical (conditional) distribution approach with bandwidth parameter sigma = 0.1 (default)
explanation <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "empirical",
  phi0 = p0
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16, 
#> and is therefore set to 2^n_features = 16.
#> 
#> ── Starting `shapr::explain()` at 2024-11-20 12:23:18 ──────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: empirical
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 6
#> • Computations (temporary) saved at:
#> '/tmp/Rtmp4yBCHY/shapr_obj_17459f7fdc4b8f.rds'
#> 
#> ── Main computation started ──
#> 
#> ℹ Using 16 of 16 coalitions.

# Printing the Shapley values for the test data.
# For more information about the interpretation of the values in the table, see ?shapr::explain.
print(explanation$shapley_values_est)
#>    explain_id     none    Solar.R      Wind      Temp      Month
#>         <int>    <num>      <num>     <num>     <num>      <num>
#> 1:          1 43.08571 13.2117337  4.785645 -25.57222  -5.599230
#> 2:          2 43.08571 -9.9727747  5.830694 -11.03873  -7.829954
#> 3:          3 43.08571 -2.2916185 -7.053393 -10.15035  -4.452481
#> 4:          4 43.08571  3.3254595 -3.240879 -10.22492  -6.663488
#> 5:          5 43.08571  4.3039571 -2.627764 -14.15166 -12.266855
#> 6:          6 43.08571  0.4786417 -5.248686 -12.55344  -6.645738

# Finally we plot the resulting explanations
plot(explanation)

See the vignette for further basic usage examples.

Contribution

All feedback and suggestions are very welcome. Details on how to contribute can be found here. If you have any questions or comments, feel free to open an issue here.

Please note that the ‘shapr’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

References

Copy Link

Version

Install

install.packages('shapr')

Monthly Downloads

1,892

Version

1.0.1

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Martin Jullum

Last Published

January 16th, 2025

Functions in shapr (1.0.1)

compute_MSEv_eval_crit

Mean Squared Error of the Contribution Function v(S)
cli_iter

Printing messages in iterative procedure with cli
compute_time

Gathers and computes the timing of the different parts of the explain function.
compute_shapley

Compute shapley values
correction_matrix_cpp

Correction term with trace_input in AICc formula
cli_startup

Printing startup messages with cli
create_coalition_table

Define coalitions, and fetch additional information about each unique coalition
create_marginal_data_training

Function that samples data from the empirical marginal training distribution
create_marginal_data_cat

Create marginal categorical data for causal Shapley values
create_marginal_data_gaussian

Generate marginal Gaussian data using Cholesky decomposition
default_doc_internal

Unexported documentation helper function.
default_doc_export

Exported documentation helper function.
exact_coalition_table

Get table with all (exact) coalitions
create_ctree

Build all the conditional inference trees
explain

Explain the output of machine learning models with dependence-aware (conditional/observational) Shapley values
explain_forecast

Explain a forecast from time series models with dependence-aware (conditional/observational) Shapley values
finalize_explanation

Gathers the final output to create the explanation object
gauss_cat_loss

A torch::nn_module() Representing a gauss_cat_loss
gaussian_transform

Transforms a sample to standardized normal distribution
get_output_args_default

Gets the default values for the output arguments
get_extra_comp_args_default

Gets the default values for the extra estimation arguments
get_predict_model

Get predict_model function
get_data_specs

Fetches feature information from a given data set
gauss_cat_sampler_random

A torch::nn_module() Representing a gauss_cat_sampler_random
get_S_causal_steps

Get the steps for generating MC samples for coalitions following a causal ordering
get_iterative_args_default

Function to specify arguments of the iterative estimation procedure
mcar_mask_generator

Missing Completely at Random (MCAR) Mask Generator
get_cov_mat

get_cov_mat
gauss_cat_sampler_most_likely

A torch::nn_module() Representing a gauss_cat_sampler_most_likely
mahalanobis_distance_cpp

(Generalized) Mahalanobis distance
get_mu_vec

get_mu_vec
get_data_forecast

Set up data for explain_forecast
gaussian_transform_separate

Transforms new data to standardized normal (dimension 1) based on other data transformations
gauss_cat_parameters

A torch::nn_module() Representing a gauss_cat_parameters
get_supported_approaches

Gets the implemented approaches
get_supported_models

Provides a data.table with the supported models
get_feature_specs

Gets the feature specifications form the model
get_model_specs

Fetches feature information from natively supported models
get_extra_parameters

This includes both extra parameters and other objects
get_max_n_coalitions_causal

Get the number of coalitions that respects the causal ordering
lag_data

Lag a matrix of variables a specific number of lags for each variables.
inv_gaussian_transform_cpp

Transforms new data to a standardized normal distribution
get_valid_causal_coalitions

Get all coalitions satisfying the causal ordering
plot_vaeac_eval_crit

Plot the training VLB and validation IWAE for vaeac models
hat_matrix_cpp

Computing single H matrix in AICc-function using the Mahalanobis distance
paired_sampler

Sampling Paired Observations
model_checker

Check that the type of model is supported by the native implementation of the model class
memory_layer

A torch::nn_module() Representing a Memory Layer
plot.shapr

Plot of the Shapley value explanations
plot_SV_several_approaches

Shapley value bar plots for several explanation objects
plot_MSEv_eval_crit

Plots of the MSEv Evaluation Criterion
observation_impute

Generate permutations of training data using test observations
prepare_data_gaussian_cpp_caus

Generate Gaussian MC samples for the causal setup with a single MC sample for each explicand
prepare_data_gaussian_cpp

Generate Gaussian MC samples
plot_vaeac_imputed_ggpairs

Plot Pairwise Plots for Imputed and True Data
prepare_data_copula_cpp_caus

Generate (Gaussian) Copula MC samples for the causal setup with a single MC sample for each explicand
prepare_data_single_coalition

Compute the conditional probabilities for a single coalition for the categorical approach
print.shapr

Print method for shapr objects
predict_model

Generate predictions for input data with specified model
prepare_data

Generate data used for predictions and Monte Carlo integration
prepare_data_causal

Generate data used for predictions and Monte Carlo integration for causal Shapley values
prepare_next_iteration

Prepares the next iteration of the iterative sampling algorithm
observation_impute_cpp

Get imputed data
sample_coalitions_cpp_str_paired

We here return a vector of strings/characters, i.e., a CharacterVector, where each string is a space-separated list of integers.
regression.check_namespaces

Check that needed libraries are installed
prepare_data_copula_cpp

Generate (Gaussian) Copula MC samples
regression.get_tune

Get if model is to be tuned
regression.get_string_to_R

Convert the string into an R object
regression.check_parameters

Check regression parameters
regression.check_recipe_func

Check regression.recipe_func
sample_combinations

Helper function to sample a combination of training and testing rows, which does not risk getting the same observation twice. Need to improve this help file.
save_results

Saves the intermediate results to disk
quantile_type7_cpp

Compute the quantiles using quantile type seven
sample_ctree

Sample ctree variables from a given conditional inference tree
specified_prob_mask_generator

A torch::nn_module() Representing a specified_prob_mask_generator
vaeac_check_activation_func

Function that checks the provided activation function
rss_cpp

Function for computing sigma_hat_sq
setup

check_setup
regression.get_y_hat

Get the predicted responses
specified_masks_mask_generator

A torch::nn_module() Representing a specified_masks_mask_generator
vaeac_check_epoch_values

Function that checks provided epoch arguments
vaeac_check_extra_named_list

Check vaeac.extra_parameters list
regression.surrogate_aug_data

Augment the training data and the explicands
reg_forecast_setup

Set up exogenous regressors for explanation in a forecast model.
regression.check_sur_n_comb

Check the regression.surrogate_n_comb parameter
skip_connection

A torch::nn_module() Representing a skip connection
sample_coalition_table

Get table with sampled coalitions
vaeac_check_cuda

Function that checks for access to CUDA
vaeac_categorical_parse_params

Creates Categorical Distributions
shapr-package

shapr: Prediction Explanation with Dependence-Aware Shapley Values
vaeac

Initializing a vaeac model
test_predict_model

Model testing function
vaeac_check_masking_ratio

Function that checks that the masking ratio argument is valid
setup_approach

Set up the framework chosen approach
vaeac_check_x_colnames

Function that checks the feature names of data and vaeac model
vaeac_check_probabilities

Function that checks probabilities
vaeac_check_logicals

Function that checks logicals
print_iter

Prints iterative information
regression.check_vfold_cv_para

vaeac_check_mask_gen

Function that checks the specified masking scheme
regression.cv_message

Produce message about which batch prepare_data is working on
process_factor_data

Treat factors as numeric values
testing_cleanup

Cleans out certain output arguments to allow perfect reproducibility of the output
vaeac_get_extra_para_default

Function to specify the extra parameters in the vaeac model
regression.train_model

Train a tidymodels model via workflows
vaeac_get_val_iwae

Compute the Importance Sampling Estimator (Validation Error)
vaeac_get_evaluation_criteria

Extract the Training VLB and Validation IWAE from a list of explanations objects using the vaeac approach
release_questions

Auxiliary function for the vignettes
vaeac_check_parameters

Function that calls all vaeac parameters check functions
vaeac_dataset

Dataset used by the vaeac model
shapley_weights

Calculate Shapley weight
vaeac_get_current_save_state

Function that extracts additional objects from the environment to the state list
vaeac_check_save_names

Function that checks that the save folder exists and for a valid file name
vaeac_check_positive_integers

Function that checks positive integers
vaeac_check_positive_numerics

Function that checks positive numerics
shapley_setup

Set up the kernelSHAP framework
vaeac_extend_batch

Extends Incomplete Batches by Sampling Extra Data from Dataloader
vaeac_compute_normalization

Compute Featurewise Means and Standard Deviations
vaeac_get_data_objects

Function to set up data loaders and save file names
vaeac_check_save_parameters

Function that gives a warning about disk usage
vaeac_get_save_file_names

Function that creates the save file names for the vaeac model
vaeac_check_which_vaeac_model

Function that checks for valid vaeac model name
vaeac_print_train_summary

Function to printout a training summary for the vaeac model
vaeac_get_optimizer

Function to create the optimizer used to train vaeac
vaeac_train_model_continue

Continue to Train the vaeac Model
vaeac_get_mask_generator_name

Function that determines which mask generator to use
vaeac_get_full_state_list

Function that extracts the state list objects from the environment
vaeac_update_para_locations

Move vaeac parameters to correct location
vaeac_get_model_from_checkp

Function to load a vaeac model and set it in the right state and mode
vaeac_get_n_decimals

Function to get string of values with specific number of decimals
vaeac_impute_missing_entries

Impute Missing Values Using Vaeac
vaeac_normal_parse_params

Creates Normal Distributions
vaeac_normalize_data

Normalize mixed data for vaeac
vaeac_save_state

Function that saves the state list and the current save state of the vaeac model
vaeac_get_x_explain_extended

Function to extend the explicands and apply all relevant masks/coalitions
vaeac_kl_normal_normal

Compute the KL Divergence Between Two Gaussian Distributions.
vaeac_train_model_auxiliary

Function used to train a vaeac model
vaeac_postprocess_data

Postprocess Data Generated by a vaeac Model
vaeac_train_model

Train the Vaeac Model
vaeac_update_pretrained_model

Function that checks and adds a pre-trained vaeac model
vaeac_preprocess_data

Preprocess Data for the vaeac approach
weight_matrix

Calculate weighted matrix
weight_matrix_cpp

Calculate weight matrix
additional_regression_setup

Additional setup for regression-based methods
check_verbose

Function that checks the verbose parameter
aicc_full_single_cpp

Temp-function for computing the full AICc with several X's etc
append_vS_list

Appends the new vS_list to the prev vS_list
check_groups

Check that the group parameter has the right form and content
cli_compute_vS

Printing messages in compute_vS with cli
categorical_to_one_hot_layer

A torch::nn_module() Representing a categorical_to_one_hot_layer
check_convergence

Checks the convergence according to the convergence threshold
aicc_full_cpp

AICc formula for several sets, alternative definition
compute_vS

Computes v(S) for all features subsets S.
coalition_matrix_cpp

Get coalition matrix
check_categorical_valid_MCsamp

Check that all explicands has at least one valid MC sample in causal Shapley values
compute_estimates

Computes the the Shapley values and their standard deviation given the v(S)
convert_feature_name_to_idx

Convert feature names into feature indices