Last chance! 50% off unlimited learning
Sale ends in
This function lets the user create a robust and fast model, using
H2O's AutoML function. The result is a list with the best model,
its parameters, datasets, performance metrics, variables
importance, and plots. Read more about the h2o_automl()
pipeline
here.
h2o_automl(
df,
y = "tag",
ignore = NULL,
train_test = NA,
split = 0.7,
weight = NULL,
target = "auto",
balance = FALSE,
impute = FALSE,
no_outliers = TRUE,
unique_train = TRUE,
center = FALSE,
scale = FALSE,
thresh = 10,
seed = 0,
nfolds = 5,
max_models = 3,
max_time = 10 * 60,
start_clean = FALSE,
exclude_algos = c("StackedEnsemble", "DeepLearning"),
include_algos = NULL,
plots = TRUE,
alarm = TRUE,
quiet = FALSE,
print = TRUE,
save = FALSE,
subdir = NA,
project = "ML Project",
...
)# S3 method for h2o_automl
plot(x, ...)
# S3 method for h2o_automl
print(x, importance = TRUE, ...)
Dataframe. Dataframe containing all your data, including
the independent variable labeled as 'tag'
. If you want to define
which variable should be used instead, use the y
parameter.
Variable or Character. Name of the independent variable.
Character vector. Force columns for the model to ignore
Character. If needed, df
's column name with 'test'
and 'train' values to split
Numeric. Value between 0 and 1 to split as train/test
datasets. Value is for training set. Set value to 1 to train with all
available data and test with same data (cross-validation will still be
used when training). If train_test
is set, value will be overwritten
with its real split rate.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.
Value. Which is your target positive value? If
set to 'auto'
, the target with largest mean(score)
will be
selected. Change the value to overwrite. Only used when binary
categorical model.
Boolean. Auto-balance train dataset with under-sampling?
Boolean. Fill NA
values with MICE?
Boolean/Numeric. Remove y
's outliers from the dataset?
Will remove those values that are farther than n standard deviations from
the independent variable's mean (Z-score). Set to TRUE
for default (3)
or numeric to set a different multiplier.
Boolean. Keep only unique row observations for training data?
Boolean. Using the base function scale, do you wish to center and/or scale all numerical values?
Integer. Threshold for selecting binary or regression
models: this number is the threshold of unique values we should
have in 'tag'
(more than: regression; less than: classification)
Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models is used because max_time is resource limited.
Number of folds for k-fold cross-validation. Defaults to 5. Use 0 to disable cross-validation; this will also disable Stacked Ensemble (thus decreasing the overall model performance).
Numeric. Max number of models and seconds you wish for the function to iterate. Note that max_models guarantees reproducibility and max_time not (because it depends entirely on your machine's computational characteristics)
Boolean. Erase everything in the current h2o
instance before we start to train models? You may want to keep other models
or not. To group results into a custom common AutoML project, you may
use project_name
argument.
Vector of character strings. Algorithms
to skip or include during the model-building phase. Set NULL to ignore.
When both are defined, only include_algos
will be valid.
Boolean. Create plots objects?
Boolean. Ping (sound) when done. Requires beepr
.
Boolean. Quiet all messages, warnings, recommendations?
Boolean. Print summary when process ends?
Boolean. Do you wish to save/export results into your working directory?
Character. In which directory do you wish to save the results? Working directory as default.
Character. Your project's name
Additional parameters on h2o::h2o.automl
h2o_automl object
Boolean. Print important variables?
List. Trained model, predicted scores and datasets used, performance
metrics, parameters, importance data.frame, seed, and plots when plots=TRUE
.
Distributed Random Forest, including Random Forest (RF) and Extremely-Randomized Trees (XRT)
Generalized Linear Model
eXtreme Grading Boosting
Gradient Boosting Machine
Fully-connected multi-layer artificial neural network
Stacked Ensemble
Use print
method to print models stats and summary
Use plot
method to plot results using mplot_full()
Other Machine Learning:
ROC()
,
conf_mat()
,
export_results()
,
gain_lift()
,
h2o_predict_API()
,
h2o_predict_MOJO()
,
h2o_predict_binary()
,
h2o_predict_model()
,
h2o_selectmodel()
,
impute()
,
iter_seeds()
,
lasso_vars()
,
model_metrics()
,
model_preprocess()
,
msplit()
# NOT RUN {
# CRAN
data(dft) # Titanic dataset
dft <- subset(dft, select = -c(Ticket, PassengerId, Cabin))
# Classification: Binomial - 2 Classes
r <- h2o_automl(dft, y = Survived, max_models = 1, impute = FALSE, target = "TRUE")
# Let's see all the stuff we have inside:
lapply(r, names)
# Classification: Multi-Categorical - 3 Classes
r <- h2o_automl(dft, Pclass, ignore = c("Fare", "Cabin"), max_time = 30, plots = FALSE)
# Regression: Continuous Values
r <- h2o_automl(dft, y = "Fare", ignore = c("Pclass"), exclude_algos = NULL, quiet = TRUE)
print(r)
# WITH PRE-DEFINED TRAIN/TEST DATAFRAMES
splits <- msplit(dft, size = 0.8)
splits$train$split <- "train"
splits$test$split <- "test"
df <- rbind(splits$train, splits$test)
r <- h2o_automl(df, "Survived", max_models = 1, train_test = "split")
# }
Run the code above in your browser using DataLab