h2o.automl: Automatic Machine Learning

Description

The Automatic Machine Learning (AutoML) function automates the supervised machine learning model training process. The current version of AutoML trains and cross-validates a Random Forest, an Extremely-Randomized Forest, a random grid of Gradient Boosting Machines (GBMs), a random grid of Deep Neural Nets, and then trains a Stacked Ensemble using all of the models.

Usage

h2o.automl(x, y, training_frame, validation_frame = NULL,
  leaderboard_frame = NULL, nfolds = 5, fold_column = NULL,
  weights_column = NULL, max_runtime_secs = 3600, max_models = NULL,
  stopping_metric = c("AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE",
  "RMSLE", "AUC", "lift_top_group", "misclassification",
  "mean_per_class_error"), stopping_tolerance = NULL, stopping_rounds = 3,
  seed = NULL, project_name = NULL)

Arguments

A vector containing the names or indices of the predictor variables to use in building the model. If x is missing, then all columns except y are used.

The name or index of the response variable in the model. For classification, the y column must be a factor, otherwise regression will be performed. Indexes are 1-based in R.

training_frame

Training data frame (or ID).

validation_frame

Validation data frame (or ID); Optional.

leaderboard_frame

Leaderboard data frame (or ID). The Leaderboard will be scored using this data set. Optional.

nfolds

Number of folds for k-fold cross-validation. Defaults to 5. Use 0 to disable cross-validation; this will also disable Stacked Ensemble (thus decreasing the overall model performance).

fold_column

Column with cross-validation fold index assignment per observation; used to override the default, randomized, 5-fold cross-validation scheme for individual models in the AutoML run.

weights_column

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.

max_runtime_secs

Maximum allowed runtime in seconds for the entire model training process. Use 0 to disable. Defaults to 3600 secs (1 hour).

max_models

Maximum number of models to build in the AutoML process (does not include Stacked Ensembles). Defaults to NULL.

stopping_metric

Metric to use for early stopping (AUTO is logloss for classification, deviance for regression). Must be one of "AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "lift_top_group", "misclassification", "mean_per_class_error". Defaults to AUTO.

stopping_tolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much). This value defaults to 0.001 if the dataset is at least 1 million rows; otherwise it defaults to a bigger value determined by the size of the dataset and the non-NA-rate. In that case, the value is computed as 1/sqrt(nrows * non-NA-rate).

stopping_rounds

Integer. Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k (stopping_rounds) scoring events. Defaults to 3 and must be an non-zero integer. Use 0 to disable early stopping.

seed

Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models or early stopping is used because max_runtime_secs is resource limited, meaning that if the resources are not the same between runs, AutoML may be able to train more models on one run vs another.

project_name

Character string to identify an AutoML project. Defaults to NULL, which means a project name will be auto-generated based on the training frame ID.

Value

An '>H2OAutoML object.

Details

AutoML finds the best model, given a training frame and response, and returns an H2OAutoML object, which contains a leaderboard of all the models that were trained in the process, ranked by a default model performance metric. Note that a Stacked Ensemble will be trained for regression and binary classification problems only since multiclass stacking is not yet supported.

Examples

Run this code

# NOT RUN {
library(h2o)
h2o.init()
votes_path <- system.file("extdata", "housevotes.csv", package="h2o")
votes_hf <- h2o.uploadFile(path = votes_path, header = TRUE)
aml <- h2o.automl(y = "Class", training_frame = votes_hf, max_runtime_secs = 30)
# }

Run the code above in your browser using DataLab