The Automatic Machine Learning (AutoML) function automates the supervised machine learning model training process. The current version of AutoML trains and cross-validates a Random Forest, an Extremely-Randomized Forest, a random grid of Gradient Boosting Machines (GBMs), a random grid of Deep Neural Nets, and then trains a Stacked Ensemble using all of the models.
h2o.automl(x, y, training_frame, validation_frame = NULL,
leaderboard_frame = NULL, nfolds = 5, fold_column = NULL,
weights_column = NULL, max_runtime_secs = 3600, max_models = NULL,
stopping_metric = c("AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE",
"RMSLE", "AUC", "lift_top_group", "misclassification",
"mean_per_class_error"), stopping_tolerance = NULL, stopping_rounds = 3,
seed = NULL, project_name = NULL)A vector containing the names or indices of the predictor variables to use in building the model. If x is missing, then all columns except y are used.
The name or index of the response variable in the model. For classification, the y column must be a factor, otherwise regression will be performed. Indexes are 1-based in R.
Training data frame (or ID).
Validation data frame (or ID); Optional.
Leaderboard data frame (or ID). The Leaderboard will be scored using this data set. Optional.
Number of folds for k-fold cross-validation. Defaults to 5. Use 0 to disable cross-validation; this will also disable Stacked Ensemble (thus decreasing the overall model performance).
Column with cross-validation fold index assignment per observation; used to override the default, randomized, 5-fold cross-validation scheme for individual models in the AutoML run.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.
Maximum allowed runtime in seconds for the entire model training process. Use 0 to disable. Defaults to 3600 secs (1 hour).
Maximum number of models to build in the AutoML process (does not include Stacked Ensembles). Defaults to NULL.
Metric to use for early stopping (AUTO is logloss for classification, deviance for regression). Must be one of "AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "lift_top_group", "misclassification", "mean_per_class_error". Defaults to AUTO.
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much). This value defaults to 0.001 if the dataset is at least 1 million rows; otherwise it defaults to a bigger value determined by the size of the dataset and the non-NA-rate. In that case, the value is computed as 1/sqrt(nrows * non-NA-rate).
Integer. Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k (stopping_rounds) scoring events. Defaults to 3 and must be an non-zero integer. Use 0 to disable early stopping.
Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models or early stopping is used because max_runtime_secs is resource limited, meaning that if the resources are not the same between runs, AutoML may be able to train more models on one run vs another.
Character string to identify an AutoML project. Defaults to NULL, which means a project name will be auto-generated based on the training frame ID.
AutoML finds the best model, given a training frame and response, and returns an H2OAutoML object, which contains a leaderboard of all the models that were trained in the process, ranked by a default model performance metric. Note that a Stacked Ensemble will be trained for regression and binary classification problems only since multiclass stacking is not yet supported.
# NOT RUN {
library(h2o)
h2o.init()
votes_path <- system.file("extdata", "housevotes.csv", package="h2o")
votes_hf <- h2o.uploadFile(path = votes_path, header = TRUE)
aml <- h2o.automl(y = "Class", training_frame = votes_hf, max_runtime_secs = 30)
# }
Run the code above in your browser using DataLab