Learn R Programming

HMDA (version 0.2.0)

hmda.compare.shap.plot: Compare SHAP plots across selected models

Description

Produces side-by-side comparison plots of SHAP contributions for multiple models. Models can be provided explicitly via model_id, or selected automatically from an hmda.grid.analysis data frame using hmda.best.models() for each metric in metrics. Two plot styles are supported:

"shap"

H2O SHAP summary plot (beeswarm-style) for each model.

"bar"

Bar plot based on a single-model shapley::shapley() run.

Usage

hmda.compare.shap.plot(
  hmda.grid.analysis,
  newdata = NULL,
  model_id = NULL,
  metrics = c("aucpr", "mcc", "f2"),
  plot = "shap",
  top_n_features = 4,
  ylimits = c(-1, 1)
)

Value

A gtable/grob object returned by gridExtra::grid.arrange() combining the plots. The combined plot is also printed.

Arguments

hmda.grid.analysis

A data frame of class "hmda.grid.analysis" containing model evaluation results and a model_ids column. Used only when model_id is NULL.

newdata

An H2OFrame used for SHAP computation. Required for both plot types.

model_id

Optional character vector of H2O model IDs. If provided, the function compares these models directly and ignores hmda.grid.analysis and metrics.

metrics

Character vector of metric names used to select the best model per metric from hmda.grid.analysis via hmda.best.models(..., n_models = 1).

plot

Character. Plot type: "shap" (default) or "bar".

top_n_features

Integer. Number of top features shown in each plot.

ylimits

Numeric vector of length 2 giving y-axis limits for plot = "shap". the default is c(-1, 1), which is only specified for aesthetic reasons, to make the plots comparable. Consider expanding these limits based on your data.

Author

E. F. Haghish

Details

When model_id is NULL, the function selects one model per metric from metrics using hmda.best.models(). When model_id is provided, models are labeled as "Model 1", "Model 2", etc.

Examples

Run this code
if (FALSE) {
  library(HMDA)
  library(h2o)
  hmda.init()

  # Import a sample binary outcome dataset into H2O
  train <- h2o.importFile(
  "https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
  test <- h2o.importFile(
  "https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")

  # Identify predictors and response
  y <- "response"
  x <- setdiff(names(train), y)

  # For binary classification, response should be a factor
  train[, y] <- as.factor(train[, y])
  test[, y] <- as.factor(test[, y])

  params <- list(learn_rate = c(0.01, 0.1),
                 max_depth = c(3, 5, 9),
                 sample_rate = c(0.8, 1.0)
  )

  # Train and validate a cartesian grid of GBMs
  hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
                          grid_id = "hmda_grid1",
                          training_frame = train,
                          nfolds = 10,
                          ntrees = 100,
                          seed = 1,
                          hyper_params = params)

  # Assess the performances of the models
  grid_performance <- hmda.grid.analysis(hmda_grid1)

  # compare the best models acording to each performance metric
  hmda.compare.shap.plot(hmda.grid.analysis = grid_performance,
                         newdata = test,
                         metrics = c("aucpr", "mcc", "f2"),
                         plot = "bar",
                         top_n_features = 5)

  # Return the best 2 models according to each metric
  best_models <- hmda.best.models(grid_performance, n_models = 2)

  # compare the specified models based on their model_ids
  hmda.compare.shap.plot(hmda.grid.analysis = grid_performance,
                         model_id = best_models$model_ids[1:3],
                         newdata = test,
                         metrics = c("aucpr", "mcc", "f2"),
                         plot = "bar",
                         top_n_features = 5)

}

Run the code above in your browser using DataLab