ensemble: Builds Stacked Ensemble Model from H2O Models

Description

Multiple trained H2O models are stacked to create an ensemble

Usage

ensemble(
  models,
  training_frame,
  newdata = NULL,
  family = "binary",
  strategy = c("search"),
  model_selection_criteria = c("auc", "aucpr", "mcc", "f2"),
  min_improvement = 1e-05,
  max = NULL,
  top_rank = seq(0.01, 0.99, 0.01),
  stop_rounds = 3,
  reset_stop_rounds = TRUE,
  stop_metric = "auc",
  seed = -1,
  verbatim = FALSE
)

Value

a list including the ensemble model and the top-rank models that were used in the model

Arguments

models: H2O search grid or AutoML grid or a character vector of H2O model IDs. the "h2o.get_ids" function from "h2otools" can retrieve the IDs from grids.
training_frame: h2o training frame (data.frame) for model training
newdata: h2o frame (data.frame). the data.frame must be already uploaded on h2o server (cloud). when specified, this dataset will be used for evaluating the models. if not specified, model performance on the training dataset will be reported.
family: model family. currently only "binary" classification models are supported.
strategy: character. the current available strategies are "search" (default) and "top". The "search" strategy searches for the best combination of top-performing diverse models whereas the "top" strategy is more simplified and just combines the specified of top-performing diverse models without examining the possibility of improving the model by searching for larger number of models that can further improve the model. generally, the "search" strategy is preferable, unless the computation runtime is too large and optimization is not possible.
model_selection_criteria: character, specifying the performance metrics that should be taken into consideration for model selection. the default are "c('auc', 'aucpr', 'mcc', 'f2')". other possible criteria are "'f1point5', 'f3', 'f4', 'f5', 'kappa', 'mean_per_class_error', 'gini', 'accuracy'", which are also provided by the "evaluate" function.
min_improvement: numeric. specifies the minimum improvement in model evaluation metric to qualify further optimization search.
max: integer. specifies maximum number of models for each criteria to be extracted. the default value is the "top_rank" percentage for each model selection criteria.
top_rank: numeric vector. specifies percentage of the top models taht should be selected. if the strategy is "search", the algorithm searches for the best best combination of the models from top ranked models to the bottom. however, if the strategy is "top", only the first value of the vector is used (default value is top 1%).
stop_rounds: integer. number of stoping rounds, in case the model stops improving
reset_stop_rounds: logical. if TRUE, every time the model improves the stopping rounds penalty is resets to 0.
stop_metric: character. model stopping metric. the default is "auc", but "aucpr" and "mcc" are also available.
seed: random seed (recommended)
verbatim: logical. if TRUE, it reports additional information about the progress of the model training, particularly used for debugging.

Author

E. F. Haghish