
This function lets the user create a robust and fast model, using H2O's AutoML function. The result is a list with the best model, its parameters, datasets, performance metrics, variables importances, and plots. If the input is categorical, classification models will be trained and if is a continuous variable, regression models will be trained.
h2o_automl(
df,
y = "tag",
ignore = c(),
train_test = NA,
split = 0.7,
weight = NULL,
target = "auto",
balance = FALSE,
impute = FALSE,
center = FALSE,
scale = FALSE,
seed = 0,
nfolds = 5,
thresh = 5,
max_models = 3,
max_time = 10 * 60,
start_clean = TRUE,
exclude_algos = c("StackedEnsemble", "DeepLearning"),
plots = TRUE,
alarm = TRUE,
quiet = FALSE,
save = FALSE,
subdir = NA,
project = "ML Project"
)
Dataframe. Dataframe containing all your data, including the independent variable labeled as 'tag'. If you want to define which variable should be used instead, use the y parameter.
Variable or Character. Name of the independent variable.
Character vector. Force columns for the model to ignore
Character. If needed, df's column name with 'test' and 'train' values to split
Numeric. Value between 0 and 1 to split as train/test datasets. Value is for training set. Set value to 1 to train will all available data and test with same data (cross-validation will still be used when training)
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.
Value. Which is your target positive value? If set to 'auto', the target with largest mean(score) will be selected. Change the value to overwrite. Only used when binary categorical model.
Boolean. Auto-balance train dataset with under-sampling?
Boolean. Fill NA values with MICE?
Boolean. Using the base function scale, do you wish to center and/or scale all numerical values?
Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models is used because max_time is resource limited.
Integer. Number of folds for k-fold cross-validation of the models. If set to 0, the test data will be used as validation, and cross-validation amd Stacked Ensembles disableded
Integer. Threshold for selecting binary or regression models: this number is the threshold of unique values we should have in 'tag' (more than: regression; less than: classification)
Numeric. Max number of models and seconds you wish for the function to iterate. Note that max_models guarantees reproducibility and max_time not (because it depends entirely on your machine's computational characteristics)
Boolean. Erase everything in the current h2o instance before we start to train models?
Vector of character strings. Algorithms to skip during the model-building phase. Set NULL to use all
Boolean. Create plots objects?
Boolean. Ping an alarm when ready! Needs beepr installed
Boolean. Quiet messages, warnings, recommendations?
Boolean. Do you wish to save/export results into your working directory?
Character. In which directory do you wish to save the results? Working directory as default.
Character. Your project's name
Use the mplot_full()
function to generate a dashboard with
your model's results and metrics, or find them in your `plots`
element within your `h2o_automl` object (be sure to have your
`plots` to `TRUE`).
"DRF" (Distributed Random Forest, including Random Forest (RF) and Extremely-Randomized Trees (XRT)), "GLM" (Generalized Linear Model), "XGBoost" (eXtreme Grading Boosting), "GBM" (Gradient Boosting Machine), "DeepLearning" (Fully-connected multi-layer artificial neural network) and "StackedEnsemble". Read more here.
Other Machine Learning:
ROC()
,
clusterKmeans()
,
conf_mat()
,
export_results()
,
gain_lift()
,
h2o_predict_API()
,
h2o_predict_MOJO()
,
h2o_predict_binary()
,
h2o_predict_model()
,
h2o_results()
,
h2o_selectmodel()
,
impute()
,
iter_seeds()
,
lasso_vars()
,
model_metrics()
,
msplit()
# NOT RUN {
data(dft) # Titanic dataset
dft <- subset(dft, select = -c(Ticket, PassengerId, Cabin))
# Classification: Binomial - 2 Classes
r <- h2o_automl(dft, y = Survived, max_models = 1, impute = FALSE, target = "TRUE")
lapply(r, names)
# Classification: Multi-Categorical - 3 Classes
r <- h2o_automl(dft, Pclass, ignore = c("Fare", "Cabin"), max_time = 30, plots = FALSE)
# Regression: Continuous Values
r <- h2o_automl(dft, y = "Fare", ignore = c("Pclass"), exclude_algos = NULL)
# WITH PRE-DEFINED TRAIN/TEST DATAFRAMES
splits <- msplit(dft, size = 0.8)
splits$train$split <- "train"
splits$test$split <- "test"
df <- rbind(splits$train, splits$test)
r <- h2o_automl(df, "Survived", max_models = 1, train_test = "split")
# }
Run the code above in your browser using DataLab