hmda.compare.shap.plot: Compare SHAP plots across selected models

Description

Produces side-by-side comparison plots of SHAP contributions for multiple models. Models can be provided explicitly via model_id, or selected automatically from an hmda.grid.analysis data frame using hmda.best.models() for each metric in metrics. Two plot styles are supported:

"shap": H2O SHAP summary plot (beeswarm-style) for each model.
"bar": Bar plot based on a single-model shapley::shapley() run.

Usage

hmda.compare.shap.plot(
  hmda.grid.analysis,
  newdata = NULL,
  model_id = NULL,
  metrics = c("aucpr", "mcc", "f2"),
  plot = "shap",
  top_n_features = 4,
  ylimits = c(-1, 1)
)

Value

A gtable/grob object returned by gridExtra::grid.arrange() combining the plots. The combined plot is also printed.

Arguments

hmda.grid.analysis: A data frame of class "hmda.grid.analysis" containing model evaluation results and a model_ids column. Used only when model_id is NULL.
newdata: An H2OFrame used for SHAP computation. Required for both plot types.
model_id: Optional character vector of H2O model IDs. If provided, the function compares these models directly and ignores hmda.grid.analysis and metrics.
metrics: Character vector of metric names used to select the best model per metric from hmda.grid.analysis via hmda.best.models(..., n_models = 1).
plot: Character. Plot type: "shap" (default) or "bar".
top_n_features: Integer. Number of top features shown in each plot.
ylimits: Numeric vector of length 2 giving y-axis limits for plot = "shap". the default is c(-1, 1), which is only specified for aesthetic reasons, to make the plots comparable. Consider expanding these limits based on your data.

Author

E. F. Haghish

Details

When model_id is NULL, the function selects one model per metric from metrics using hmda.best.models(). When model_id is provided, models are labeled as "Model 1", "Model 2", etc.

Examples

Run this code

if (FALSE) {
  library(HMDA)
  library(h2o)
  hmda.init()

  # Import a sample binary outcome dataset into H2O
  train <- h2o.importFile(
  "https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
  test <- h2o.importFile(
  "https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")

  # Identify predictors and response
  y <- "response"
  x <- setdiff(names(train), y)

  # For binary classification, response should be a factor
  train[, y] <- as.factor(train[, y])
  test[, y] <- as.factor(test[, y])

  params <- list(learn_rate = c(0.01, 0.1),
                 max_depth = c(3, 5, 9),
                 sample_rate = c(0.8, 1.0)
  )

  # Train and validate a cartesian grid of GBMs
  hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
                          grid_id = "hmda_grid1",
                          training_frame = train,
                          nfolds = 10,
                          ntrees = 100,
                          seed = 1,
                          hyper_params = params)

  # Assess the performances of the models
  grid_performance <- hmda.grid.analysis(hmda_grid1)

  # compare the best models acording to each performance metric
  hmda.compare.shap.plot(hmda.grid.analysis = grid_performance,
                         newdata = test,
                         metrics = c("aucpr", "mcc", "f2"),
                         plot = "bar",
                         top_n_features = 5)

  # Return the best 2 models according to each metric
  best_models <- hmda.best.models(grid_performance, n_models = 2)

  # compare the specified models based on their model_ids
  hmda.compare.shap.plot(hmda.grid.analysis = grid_performance,
                         model_id = best_models$model_ids[1:3],
                         newdata = test,
                         metrics = c("aucpr", "mcc", "f2"),
                         plot = "bar",
                         top_n_features = 5)

}

Run the code above in your browser using DataLab