Multiple trained H2O models are stacked to create an ensemble
ensemble(
models,
training_frame,
newdata = NULL,
family = "binary",
strategy = c("search"),
model_selection_criteria = c("auc", "aucpr", "mcc", "f2"),
min_improvement = 1e-05,
max = NULL,
top_rank = seq(0.01, 0.99, 0.01),
stop_rounds = 3,
reset_stop_rounds = TRUE,
stop_metric = "auc",
seed = -1,
verbatim = FALSE
)a list including the ensemble model and the top-rank models that were used in the model
H2O search grid or AutoML grid or a character vector of H2O model IDs.
the "h2o.get_ids" function from "h2otools" can
retrieve the IDs from grids.
h2o training frame (data.frame) for model training
h2o frame (data.frame). the data.frame must be already uploaded on h2o server (cloud). when specified, this dataset will be used for evaluating the models. if not specified, model performance on the training dataset will be reported.
model family. currently only "binary" classification models
are supported.
character. the current available strategies are "search"
(default) and "top". The "search" strategy searches
for the best combination of top-performing diverse models
whereas the "top" strategy is more simplified and just
combines the specified of top-performing diverse models without
examining the possibility of improving the model by searching for
larger number of models that can further improve the model. generally,
the "search" strategy is preferable, unless the computation
runtime is too large and optimization is not possible.
character, specifying the performance metrics that
should be taken into consideration for model selection. the default are
"c('auc', 'aucpr', 'mcc', 'f2')". other possible criteria are
"'f1point5', 'f3', 'f4', 'f5', 'kappa', 'mean_per_class_error', 'gini', 'accuracy'",
which are also provided by the "evaluate" function.
numeric. specifies the minimum improvement in model evaluation metric to qualify further optimization search.
integer. specifies maximum number of models for each criteria to be extracted. the
default value is the "top_rank" percentage for each model selection
criteria.
numeric vector. specifies percentage of the top models taht
should be selected. if the strategy is "search", the
algorithm searches for the best best combination of the models
from top ranked models to the bottom. however, if the strategy
is "top", only the first value of the vector is used
(default value is top 1%).
integer. number of stoping rounds, in case the model stops improving
logical. if TRUE, every time the model improves the stopping rounds penalty is resets to 0.
character. model stopping metric. the default is "auc",
but "aucpr" and "mcc" are also available.
random seed (recommended)
logical. if TRUE, it reports additional information about the progress of the model training, particularly used for debugging.
E. F. Haghish