hmda.feature.selection: Feature Selection Based on Weighted SHAP Values

Description

This function selects "important", "inessential", and "irrelevant" features based on a summary of weighted mean SHAP values obtained from a prior analysis. It uses the SHAP summary table (found in the wmshap object) to identify features that are deemed important according to a specified method and cutoff. Features with a lower confidence interval (lowerCI) below zero are labeled as "irrelevant", while the remaining features are classified as "inessential" if they do not meet the importance criteria.

Usage

hmda.feature.selection(
  wmshap,
  method = c("mean"),
  cutoff = 0.01,
  top_n_features = NULL
)

Value

A list with three elements:

important: A character vector of features deemed important.
inessential: A character vector of features considered inessential (present in the data but not meeting the importance criteria).
irrelevant: A character vector of features deemed irrelevant, defined as those with a lower confidence interval (lowerCI) below zero.

Arguments

wmshap: A list object (typically returned by a weighted SHAP analysis) that must contain a data frame summaryShaps with at least the columns "feature", "mean", and "lowerCI". It may also contain additional columns for alternative selection methods.
method: Character. Specify the method for selecting important features based on their weighted mean SHAP ratios. The default is "mean", which selects features whose weighted mean shap ratio (WMSHAP) exceeds the cutoff. The alternative is "lowerCI", which selects features whose lower bound of confidence interval exceeds the cutoff.
cutoff: Numeric. The threshold cutoff for the selection method. Features with a weighted SHAP value (or ratio) greater than or equal to this value are considered important. Default is 0.01.
top_n_features: Integer. If specified, the function selects the top top_n_features features (based on the sorted SHAP mean values), overriding the cutoff and method arguments. If NULL, all features that meet the cutoff criteria are used. Default is NULL.

Author

E. F. Haghish

Details

The function performs the following steps:

Retrieves the SHAP summary table from the wmshap object.
Sorts the summary table in descending order based on the mean SHAP value.
Identifies all features available in the summary.
Classifies features as irrelevant if their lowerCI value is below zero.
If top_n_features is not specified, selects important features as those whose value for the specified method column meets or exceeds the cutoff; the remaining features (excluding those marked as irrelevant) are classified as inessential.
If top_n_features is provided, the function selects the top n features (based on the sorted order) as important, with the rest (excluding irrelevant ones) being inessential.

Examples

Run this code

if (FALSE) {
library(HMDA)
library(h2o)
hmda.init()
h2o.removeAll()

# Import a sample binary outcome dataset into H2O
train <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
test <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")

# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)

# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])

params <- list(learn_rate = c(0.01, 0.1),
               max_depth = c(3, 5, 9),
               sample_rate = c(0.8, 1.0)
)

# Train and validate a cartesian grid of GBMs
hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
                        grid_id = "hmda_grid1",
                        training_frame = train,
                        nfolds = 10,
                        ntrees = 100,
                        seed = 1,
                        hyper_params = gbm_params1)

# Assess the performances of the models
grid_performance <- hmda.grid.analysis(hmda_grid1)

# Return the best 2 models according to each metric
hmda.best.models(grid_performance, n_models = 2)

# build an autoEnsemble model & test it with the testing dataset
meta <- hmda.autoEnsemble(models = hmda_grid1, training_frame = train)
print(h2o.performance(model = meta$model, newdata = test))

# compute weighted mean shap values
wmshap <- hmda.wmshap(models = hmda_grid1,
                      newdata = test,
                      performance_metric = "aucpr",
                      standardize_performance_metric = FALSE,
                      performance_type = "xval",
                      minimum_performance = 0,
                      method = "mean",
                      cutoff = 0.01,
                      plot = TRUE)

# identify the important features
selected <- hmda.feature.selection(wmshap,
                                   method = c("mean"),
                                   cutoff = 0.01)
print(selected)
}

Run the code above in your browser using DataLab