hmda.autoEnsemble: Build Stacked Ensemble Model Using autoEnsemble R package

Description

Wrapper function in the HMDA package that builds a stacked ensemble model by combining multiple models. It leverages the autoEnsemble package to stack a set of trained models (e.g., from HMDA grid) into a stronger meta-learner. For more details on autoEnsemble, please see the GitHub repository at https://github.com/haghish/autoEnsemble and the CRAN package of autoEnsemble R package.

Usage

hmda.autoEnsemble(
  models,
  training_frame,
  newdata = NULL,
  family = "binary",
  strategy = c("search"),
  model_selection_criteria = c("auc", "aucpr", "mcc", "f2"),
  min_improvement = 1e-05,
  max = NULL,
  top_rank = seq(0.01, 0.99, 0.01),
  stop_rounds = 3,
  reset_stop_rounds = TRUE,
  stop_metric = "auc",
  seed = -1,
  verbatim = FALSE
)

Value

A list containing:

model: The ensemble model built by autoEnsemble.
top_models: A data frame of the top-ranked base models that were used in building the ensemble.

Arguments

models: A grid object (e.g., an H2O grid / HMDA grid) or a character vector of model IDs.
training_frame: An H2OFrame (or data frame already uploaded to the H2O server) that contains the training data used to build the base models.
newdata: Optional H2OFrame used for evaluating the ensemble during the stacking.
family: Character string specifying the modeling family (e.g., "binary"). See the autoEnsemble package for full documentation.
strategy: A character vector specifying the ensemble strategy. The available strategy is "search" (default). The "search" strategy searches for the best combination of top-performing diverse models.
model_selection_criteria: A character vector specifying the performance metrics to consider for model selection. The default is c("auc", "aucpr", "mcc", "f2"). Other possible criteria include "f1point5", "f3", "f4", "f5", "kappa", "mean_per_class_error", "gini", and "accuracy".
min_improvement: Numeric. The minimum improvement in the evaluation metric required to continue the ensemble search.
max: Integer. The maximum number of models for each selection criterion. If NULL, a default value based on the top rank percentage is used.
top_rank: Numeric vector. Specifies the percentage (or percentages) of the top models that should be considered for ensemble selection. If the strategy is "search", the function searches for the best combination of models from the top to the bottom ranked; if the strategy is "top", only the first value is used. Default is seq(0.01, 0.99, 0.01).
stop_rounds: Integer. The number of consecutive rounds with no improvement in the performance metric before stopping the search.
reset_stop_rounds: Logical. If TRUE, the stopping rounds counter is reset each time an improvement is observed.
stop_metric: Character. The metric used for early stopping; the default is "auc". Other options include "aucpr" and "mcc".
seed: Integer. A random seed for reproducibility. Default is -1.
verbatim: Logical. If TRUE, the function prints additional progress information for debugging purposes.

Author

E. F. Haghish

Details

This wrapper function integrates with the HMDA package workflow to build a stacked ensemble model from a set of base models. It calls the ensemble() function from the autoEnsemble package to construct the ensemble. The function is designed to work within HMDA's framework, where base models are generated via grid search or AutoML. For more details on the autoEnsemble approach, see:

GitHub: https://github.com/haghish/autoEnsemble
CRAN: https://CRAN.R-project.org/package=autoEnsemble

The ensemble strategy "search" (default) searches for the best combination of top-performing and diverse models to improve overall performance. The wrapper returns both the final ensemble model and the list of top-ranked models used in the ensemble.

Examples

Run this code

if (FALSE) {
  library(HMDA)
  library(h2o)
  hmda.init()

  # Import a sample binary outcome dataset into H2O
  train <- h2o.importFile(
  "https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
  test <- h2o.importFile(
  "https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")

  # Identify predictors and response
  y <- "response"
  x <- setdiff(names(train), y)

  # For binary classification, response should be a factor
  train[, y] <- as.factor(train[, y])
  test[, y] <- as.factor(test[, y])

  params <- list(learn_rate = c(0.01, 0.1),
                 max_depth = c(3, 5, 9),
                 sample_rate = c(0.8, 1.0)
  )

  # Train and validate a cartesian grid of GBMs
  hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
                          grid_id = "hmda_grid1",
                          training_frame = train,
                          nfolds = 10,
                          ntrees = 100,
                          seed = 1,
                          hyper_params = params)

  # Assess the performances of the models
  grid_performance <- hmda.grid.analysis(hmda_grid1)

  # Return the best 2 models according to each metric
  hmda.best.models(grid_performance, n_models = 2)

  # build an autoEnsemble model & test it with the testing dataset
  meta <- hmda.autoEnsemble(models = hmda_grid1, training_frame = train)
  print(h2o.performance(model = meta$model, newdata = test))
}

Run the code above in your browser using DataLab