Usage
h2o.gbm(x, y, training_frame, model_id, checkpoint, distribution = c("AUTO",
"gaussian", "bernoulli", "multinomial", "poisson", "gamma", "tweedie"),
tweedie_power = 1.5, ntrees = 50, max_depth = 5, min_rows = 10,
learn_rate = 0.1, nbins = 20, nbins_cats = 1024,
validation_frame = NULL, balance_classes = FALSE,
max_after_balance_size = 1, seed, build_tree_one_node = FALSE,
nfolds = 0, fold_column = NULL, fold_assignment = c("AUTO", "Random",
"Modulo"), keep_cross_validation_predictions = FALSE,
score_each_iteration = FALSE, offset_column = NULL,
weights_column = NULL, ...)
Arguments
x
A vector containing the names or indices of the predictor variables to use in building the GBM model.
y
The name or index of the response variable. If the data does not contain a header, this is the column index
number starting at 0, and increasing from left to right. (The response must be either an integer or a
categorical variable).
training_frame
An H2OFrame
object containing the variables in the model.
model_id
(Optional) The unique id assigned to the resulting model. If
none is given, an id will automatically be generated.
checkpoint
"Model checkpoint (either key or H2ODeepLearningModel) to resume training with."
distribution
A character
string. The distribution function of the response.
Must be "AUTO", "bernoulli", "multinomial", "poisson", "gamma", "tweedie" or "gaussian"
tweedie_power
Tweedie power (only for Tweedie distribution, must be between 1 and 2)
ntrees
A nonnegative integer that determines the number of trees to grow.
max_depth
Maximum depth to grow the tree.
min_rows
Minimum number of rows to assign to teminal nodes.
learn_rate
An integer
from 0.0
to 1.0
nbins
For numerical columns (real/int), build a histogram of this many bins, then split at the best point
nbins_cats
For categorical columns (enum), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.
validation_frame
An H2OFrame
object indicating the validation dataset used to contruct the
confusion matrix. If left blank, this defaults to the training data when nfolds = 0
balance_classes
logical, indicates whether or not to balance training data class
counts via over/under-sampling (for imbalanced data)
max_after_balance_size
Maximum relative size of the training data after balancing class counts (can be less
than 1.0)
seed
Seed for random numbers (affects sampling when balance_classes=T)
build_tree_one_node
Run on one node only; no network overhead but
fewer cpus used. Suitable for small datasets.
nfolds
(Optional) Number of folds for cross-validation. If nfolds >= 2
, then validation
must remain empty.
fold_column
(Optional) Column with cross-validation fold index assignment per observation
fold_assignment
Cross-validation fold assignment scheme, if fold_column is not specified
Must be "AUTO", "Random" or "Modulo"
keep_cross_validation_predictions
Whether to keep the predictions of the cross-validation models
score_each_iteration
Attempts to score each tree.
offset_column
Specify the offset column.
weights_column
Specify the weights column.
...
extra arguments to pass on (currently no implemented)