Learn R Programming

lares (version 4.7)

h2o_automl: Automated H2O's AutoML

Description

This function lets the user create a robust and fast model, using H2O's AutoML function. The result is a list with the best model, its parameters, datasets, performance metrics, variables importances, and other useful metrics.

Usage

h2o_automl(df, y = "tag", ignore = c(), train_test = NA,
  split = 0.7, weight = NULL, balance = FALSE, impute = FALSE,
  center = FALSE, scale = FALSE, seed = 0, nfolds = 5,
  thresh = 5, max_time = 5 * 60, max_models = 10,
  start_clean = TRUE, exclude_algos = c("StackedEnsemble",
  "DeepLearning"), plots = TRUE, alarm = TRUE, quiet = FALSE,
  save = FALSE, subdir = NA, project = "ML Project")

Arguments

df

Dataframe. Dataframe containing all your data, including the independent variable labeled as 'tag'. If you want to define which variable should be used instead, use the y parameter.

y

Character. Name of the independent variable

ignore

Character vector. Force columns for the model to ignore

train_test

Character. If needed, df's column name with 'test' and 'train' values to split

split

Numeric. Value between 0 and 1 to split as train/test datasets. Value is for training set.

weight

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.

balance

Boolean. Auto-balance train dataset with under-sampling?

impute

Boolean. Fill NA values with MICE?

center, scale

Boolean. Using the base function scale, do you wish to center and/or scale all numerical values?

seed

Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models is used because max_time is resource limited.

nfolds

Integer. Number of folds for k-fold cross-validation of the models. If set to 0, the test data will be used as validation, and cross-validation amd Stacked Ensembles disableded

thresh

Integer. Threshold for selecting binary or regression models: this number is the threshold of unique values we should have in 'tag' (more than: regression; less than: classification)

max_time

Numeric. Max seconds you wish for the function to iterate

max_models

Numeric. Max models you wish for the function to create

start_clean

Boolean. Erase everything in the current h2o instance before we start to train models?

exclude_algos

Vector of character strings. Algorithms to skip during the model-building phase. Set NULL to use all

plots

Boolean. Create plots objects?

alarm

Boolean. Ping an alarm when ready!

quiet

Boolean. Quiet messages, warnings, recommendations?

save

Boolean. Do you wish to save/export results into your working directory?

subdir

Character. In which directory do you wish to save the results? Working directory as default.

project

Character. Your project's name

Details

Full list of algorithms: "DRF" (Distributed Random Forest, including Random Forest (RF) and Extremely-Randomized Trees (XRT)), "GLM" (Generalized Linear Model), "XGBoost" (eXtreme Grading Boosting), "GBM" (Gradient Boosting Machine), "DeepLearning" (Fully-connected multi-layer artificial neural network) and "StackedEnsemble". Read more: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

See Also

Other Machine Learning: ROC, clusterKmeans, conf_mat, export_results, gain_lift, h2o_predict_API, h2o_predict_MOJO, h2o_predict_binary, h2o_predict_model, h2o_selectmodel, impute, iter_seeds, model_metrics, mplot_conf, mplot_cuts_error, mplot_cuts, mplot_density, mplot_full, mplot_gain, mplot_importance, mplot_lineal, mplot_metrics, mplot_response, mplot_roc, mplot_splits, msplit