h2o_automl: Automated H2O's AutoML

Description

This function lets the user create a robust and fast model, using H2O's AutoML function. The result is a list with the best model, its parameters, datasets, performance metrics, variables importances, and plots. If the input is categorical, classification models will be trained and if is a continuous variable, regression models will be trained.

Usage

h2o_automl(
  df,
  y = "tag",
  ignore = c(),
  train_test = NA,
  split = 0.7,
  weight = NULL,
  target = "auto",
  balance = FALSE,
  impute = FALSE,
  center = FALSE,
  scale = FALSE,
  seed = 0,
  nfolds = 5,
  thresh = 5,
  max_models = 3,
  max_time = 10 * 60,
  start_clean = TRUE,
  exclude_algos = c("StackedEnsemble", "DeepLearning"),
  plots = TRUE,
  alarm = TRUE,
  quiet = FALSE,
  save = FALSE,
  subdir = NA,
  project = "ML Project"
)

Arguments

Dataframe. Dataframe containing all your data, including the independent variable labeled as 'tag'. If you want to define which variable should be used instead, use the y parameter.

Variable or Character. Name of the independent variable.

ignore

Character vector. Force columns for the model to ignore

train_test

Character. If needed, df's column name with 'test' and 'train' values to split

split

Numeric. Value between 0 and 1 to split as train/test datasets. Value is for training set. Set value to 1 to train will all available data and test with same data (cross-validation will still be used when training)

weight

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.

target

Value. Which is your target positive value? If set to 'auto', the target with largest mean(score) will be selected. Change the value to overwrite. Only used when binary categorical model.

balance

Boolean. Auto-balance train dataset with under-sampling?

impute

Boolean. Fill NA values with MICE?

center, scale

Boolean. Using the base function scale, do you wish to center and/or scale all numerical values?

seed

Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models is used because max_time is resource limited.

nfolds

Integer. Number of folds for k-fold cross-validation of the models. If set to 0, the test data will be used as validation, and cross-validation amd Stacked Ensembles disableded

thresh

Integer. Threshold for selecting binary or regression models: this number is the threshold of unique values we should have in 'tag' (more than: regression; less than: classification)

max_models, max_time

Numeric. Max number of models and seconds you wish for the function to iterate. Note that max_models guarantees reproducibility and max_time not (because it depends entirely on your machine's computational characteristics)

start_clean

Boolean. Erase everything in the current h2o instance before we start to train models?

exclude_algos

Vector of character strings. Algorithms to skip during the model-building phase. Set NULL to use all

plots

Boolean. Create plots objects?

alarm

Boolean. Ping an alarm when ready! Needs beepr installed

quiet

Boolean. Quiet messages, warnings, recommendations?

save

Boolean. Do you wish to save/export results into your working directory?

subdir

Character. In which directory do you wish to save the results? Working directory as default.

project

Character. Your project's name

Plot Results into Dashboard

Use the mplot_full() function to generate a dashboard with your model's results and metrics, or find them in your `plots` element within your `h2o_automl` object (be sure to have your `plots` to `TRUE`).

List of algorithms

"DRF" (Distributed Random Forest, including Random Forest (RF) and Extremely-Randomized Trees (XRT)), "GLM" (Generalized Linear Model), "XGBoost" (eXtreme Grading Boosting), "GBM" (Gradient Boosting Machine), "DeepLearning" (Fully-connected multi-layer artificial neural network) and "StackedEnsemble". Read more here.

Examples

Run this code

# NOT RUN {
data(dft) # Titanic dataset
dft <- subset(dft, select = -c(Ticket, PassengerId, Cabin))

# Classification: Binomial - 2 Classes
r <- h2o_automl(dft, y = Survived, max_models = 1, impute = FALSE, target = "TRUE")
lapply(r, names)

# Classification: Multi-Categorical - 3 Classes
r <- h2o_automl(dft, Pclass, ignore = c("Fare", "Cabin"), max_time = 30, plots = FALSE)

# Regression: Continuous Values
r <- h2o_automl(dft, y = "Fare", ignore = c("Pclass"), exclude_algos = NULL)

# WITH PRE-DEFINED TRAIN/TEST DATAFRAMES
splits <- msplit(dft, size = 0.8)
splits$train$split <- "train"
splits$test$split <- "test"
df <- rbind(splits$train, splits$test)
r <- h2o_automl(df, "Survived", max_models = 1, train_test = "split")
# }

Run the code above in your browser using DataLab