Learn R Programming

HMDA (version 0.2.0)

hmda.autoEnsemble: Build Stacked Ensemble Model Using autoEnsemble R package

Description

Wrapper function in the HMDA package that builds a stacked ensemble model by combining multiple models. It leverages the autoEnsemble package to stack a set of trained models (e.g., from HMDA grid) into a stronger meta-learner. For more details on autoEnsemble, please see the GitHub repository at https://github.com/haghish/autoEnsemble and the CRAN package of autoEnsemble R package.

Usage

hmda.autoEnsemble(
  models,
  training_frame,
  newdata = NULL,
  family = "binary",
  strategy = c("search"),
  model_selection_criteria = c("auc", "aucpr", "mcc", "f2"),
  min_improvement = 1e-05,
  max = NULL,
  top_rank = seq(0.01, 0.99, 0.01),
  stop_rounds = 3,
  reset_stop_rounds = TRUE,
  stop_metric = "auc",
  seed = -1,
  verbatim = FALSE
)

Value

A list containing:

model

The ensemble model built by autoEnsemble.

top_models

A data frame of the top-ranked base models that were used in building the ensemble.

Arguments

models

A grid object (e.g., an H2O grid / HMDA grid) or a character vector of model IDs.

training_frame

An H2OFrame (or data frame already uploaded to the H2O server) that contains the training data used to build the base models.

newdata

Optional H2OFrame used for evaluating the ensemble during the stacking.

family

Character string specifying the modeling family (e.g., "binary"). See the autoEnsemble package for full documentation.

strategy

A character vector specifying the ensemble strategy. The available strategy is "search" (default). The "search" strategy searches for the best combination of top-performing diverse models.

model_selection_criteria

A character vector specifying the performance metrics to consider for model selection. The default is c("auc", "aucpr", "mcc", "f2"). Other possible criteria include "f1point5", "f3", "f4", "f5", "kappa", "mean_per_class_error", "gini", and "accuracy".

min_improvement

Numeric. The minimum improvement in the evaluation metric required to continue the ensemble search.

max

Integer. The maximum number of models for each selection criterion. If NULL, a default value based on the top rank percentage is used.

top_rank

Numeric vector. Specifies the percentage (or percentages) of the top models that should be considered for ensemble selection. If the strategy is "search", the function searches for the best combination of models from the top to the bottom ranked; if the strategy is "top", only the first value is used. Default is seq(0.01, 0.99, 0.01).

stop_rounds

Integer. The number of consecutive rounds with no improvement in the performance metric before stopping the search.

reset_stop_rounds

Logical. If TRUE, the stopping rounds counter is reset each time an improvement is observed.

stop_metric

Character. The metric used for early stopping; the default is "auc". Other options include "aucpr" and "mcc".

seed

Integer. A random seed for reproducibility. Default is -1.

verbatim

Logical. If TRUE, the function prints additional progress information for debugging purposes.

Author

E. F. Haghish

Details

This wrapper function integrates with the HMDA package workflow to build a stacked ensemble model from a set of base models. It calls the ensemble() function from the autoEnsemble package to construct the ensemble. The function is designed to work within HMDA's framework, where base models are generated via grid search or AutoML. For more details on the autoEnsemble approach, see:

The ensemble strategy "search" (default) searches for the best combination of top-performing and diverse models to improve overall performance. The wrapper returns both the final ensemble model and the list of top-ranked models used in the ensemble.

Examples

Run this code
if (FALSE) {
  library(HMDA)
  library(h2o)
  hmda.init()

  # Import a sample binary outcome dataset into H2O
  train <- h2o.importFile(
  "https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
  test <- h2o.importFile(
  "https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")

  # Identify predictors and response
  y <- "response"
  x <- setdiff(names(train), y)

  # For binary classification, response should be a factor
  train[, y] <- as.factor(train[, y])
  test[, y] <- as.factor(test[, y])

  params <- list(learn_rate = c(0.01, 0.1),
                 max_depth = c(3, 5, 9),
                 sample_rate = c(0.8, 1.0)
  )

  # Train and validate a cartesian grid of GBMs
  hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
                          grid_id = "hmda_grid1",
                          training_frame = train,
                          nfolds = 10,
                          ntrees = 100,
                          seed = 1,
                          hyper_params = params)

  # Assess the performances of the models
  grid_performance <- hmda.grid.analysis(hmda_grid1)

  # Return the best 2 models according to each metric
  hmda.best.models(grid_performance, n_models = 2)

  # build an autoEnsemble model & test it with the testing dataset
  meta <- hmda.autoEnsemble(models = hmda_grid1, training_frame = train)
  print(h2o.performance(model = meta$model, newdata = test))
}

Run the code above in your browser using DataLab