hmda.best.models: Select Best Models Across All Models in HMDA Grid

Description

Scans an HMDA grid analysis data frame for performance metric columns and, for each metric, selects the best-performing models according to the correct optimization direction (lower is better for some metrics; higher is better for others). The function returns a subset of the input data frame containing the union of selected model IDs.

Usage

hmda.best.models(
  df,
  n_models = NULL,
  distance_percentage = NULL,
  metrics = c("logloss", "mae", "mse", "rmse", "rmsle", "mean_per_class_error", "auc",
    "aucpr", "r2", "accuracy", "f1", "mcc", "f2"),
  hyperparam = FALSE
)

Value

A data frame containing the union of selected models across all considered metrics. If hyperparam = FALSE, the output includes model_ids and the metric columns found in df. If hyperparam = TRUE, the output includes all columns from df for the selected models.

Arguments

df: A data frame of class "hmda.grid.analysis" containing model performance results. It must include a column named model_ids.
n_models: Integer. The number of top models to select per metric. If both n_models and distance_percentage are NULL, defaults to 1.
distance_percentage: Numeric in (0, 1). Alternative to n_models. Selects all models within a given percentage distance of the best value for each metric (direction-aware). You must specify either n_models or distance_percentage, not both. distance_percentage is direction-aware. For example, when metric is AUC, if the distance_percentage is set to 1 have AUC equal or lower than 99 specified that lower values mean better performance, such as logloss, then a distance_percentage of 1 the model with the lowest logloss.
metrics: Character vector of performance metric column names to consider. Supported metrics are "logloss", "mae", "mse", "rmse", "rmsle", "mean_per_class_error", "auc", "aucpr", "r2", "accuracy", "f1", "mcc", "f2".
hyperparam: Logical. If TRUE, returns all columns for the selected models (including hyperparameters). If FALSE, returns only model_ids plus the selected metric columns.

Author

E. F. Haghish

Details

The function uses a predefined set of H2O performance metrics along with their desired optimization directions:

logloss, mae, mse, rmse, rmsle, mean_per_class_error: Lower values are better.
auc, aucpr, r2, accuracy, f1, mcc, f2: Higher values are better.

Examples

Run this code

if (FALSE) {
  library(HMDA)
  library(h2o)
  hmda.init()

  # Import a sample binary outcome dataset into H2O
  train <- h2o.importFile(
  "https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
  test <- h2o.importFile(
  "https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")

  # Identify predictors and response
  y <- "response"
  x <- setdiff(names(train), y)

  # For binary classification, response should be a factor
  train[, y] <- as.factor(train[, y])
  test[, y] <- as.factor(test[, y])

  params <- list(learn_rate = c(0.01, 0.1),
                 max_depth = c(3, 5, 9),
                 sample_rate = c(0.8, 1.0)
  )

  # Train and validate a cartesian grid of GBMs
  hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
                          grid_id = "hmda_grid1",
                          training_frame = train,
                          nfolds = 10,
                          ntrees = 100,
                          seed = 1,
                          hyper_params = params)

  # Assess the performances of the models
  grid_performance <- hmda.grid.analysis(hmda_grid1)

  # Return the best 2 models according to each metric
  hmda.best.models(grid_performance, n_models = 2)

  # return all models with performance metric as high as 98% of the best model, for each metric
  # i.e., the distance of the selected models should be up to 2% from the
  # best model in each metric
  hmda.best.models(grid_performance, distance_percentage = 0.02)
}

Run the code above in your browser using DataLab