Learn R Programming

h2o (version 3.0.0.22)

h2o.gbm: Gradient Boosted Machines

Description

Builds gradient boosted classification trees, and gradient boosted regression trees on a parsed data set.

Usage

h2o.gbm(x, y, training_frame, model_id, distribution = c("AUTO", "gaussian",
  "bernoulli", "multinomial"), ntrees = 50, max_depth = 5, min_rows = 10,
  learn_rate = 0.1, nbins = 20, nbins_cats = 1024,
  validation_frame = NULL, balance_classes = FALSE,
  max_after_balance_size = 1, seed, nfolds, score_each_iteration, ...)

Arguments

x
A vector containing the names or indices of the predictor variables to use in building the GBM model.
y
The name or index of the response variable. If the data does not contain a header, this is the column index number starting at 0, and increasing from left to right. (The response must be either an integer or a categorical variable).
training_frame
An H2OFrame object containing the variables in the model.
model_id
(Optional) The unique id assigned to the resulting model. If none is given, an id will automatically be generated.
distribution
A character string. The loss function to be implemented. Must be "AUTO", "bernoulli", "multinomial", or "gaussian"
ntrees
A nonnegative integer that determines the number of trees to grow.
max_depth
Maximum depth to grow the tree.
min_rows
Minimum number of rows to assign to teminal nodes.
learn_rate
An interger from 0.0 to 1.0
nbins
For numerical columns (real/int), build a histogram of this many bins, then split at the best point
nbins_cats
For categorical columns (enum), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.
validation_frame
An H2OFrame object indicating the validation dataset used to contruct the confusion matrix. If left blank, this defaults to the training data when nfolds = 0
balance_classes
logical, indicates whether or not to balance training data class counts via over/under-sampling (for imbalanced data)
max_after_balance_size
Maximum relative size of the training data after balancing class counts (can be less than 1.0)
seed
Seed for random numbers (affects sampling) - Note: only reproducible when running single threaded
nfolds
(Optional) Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty. **Currently not supported**
score_each_iteration
Attempts to score each tree.
...
extra arguments to pass on (currently no implemented)

Details

The default distribution function will guess the model type based on the response column typerun properly the response column must be an numeric for "gaussian" or an enum for "bernoulli" or "multinomial".

See Also

predict.H2OModel for prediction.

Examples

Run this code
library(h2o)
localH2O = h2o.init()

# Run regression GBM on australia.hex data
ausPath <- system.file("extdata", "australia.csv", package="h2o")
australia.hex <- h2o.uploadFile(localH2O, path = ausPath)
independent <- c("premax", "salmax","minairtemp", "maxairtemp", "maxsst",
                 "maxsoilmoist", "Max_czcs")
dependent <- "runoffnew"
h2o.gbm(y = dependent, x = independent, training_frame = australia.hex,
        ntrees = 3, max_depth = 3, min_rows = 2)

Run the code above in your browser using DataLab