h2o.gbm: H2O: Gradient Boosted Machines

Description

Builds gradient boosted classification trees, and gradient boosed regression trees on a parsed data set.

Usage

h2o.gbm(x, y, distribution = "multinomial", data, key = "", n.trees = 10, 
  interaction.depth = 5, n.minobsinnode = 10, shrinkage = 0.1, n.bins = 20,
  group_split = TRUE, importance = FALSE, nfolds = 0, validation, holdout.fraction = 0,
  balance.classes = FALSE, max.after.balance.size = 5, class.sampling.factors = NULL,
  grid.parallelism = 1)

Arguments

A vector containing the names or indices of the predictor variables to use in building the GBM model.

The name or index of the response variable. If the data does not contain a header, this is the column index number starting at 0, and increasing from left to right. (The response must be either an integer or a categorical variable).

distribution

The type of GBM model to be produced: classification is "multinomial" (default), "gaussian" is used for regression, and "bernoulli" for binary outcomes.

data

An H2OParsedData object containing the variables in the model.

key

(Optional) The unique hex key assigned to the resulting model. If none is given, a key will automatically be generated.

n.trees

(Optional) Number of trees to grow. Must be a nonnegative integer.

interaction.depth

(Optional) Maximum depth to grow the tree.

n.minobsinnode

(Optional) Minimum number of rows to assign to teminal nodes.

shrinkage

(Optional) A learning-rate parameter defining step size reduction.

n.bins

(Optional) Number of bins to use in building histogram.

group_split

(Optional) default is TRUE. If FALSE, does not do the bit-set group splitting categoricals, but 1 vs. many.

importance

(Optional) A logical value indicating whether variable importance should be calculated. This will increase the amount of time for the algorithm to complete.

nfolds

(Optional) Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.

validation

(Optional) An H2OParsedData object indicating the validation dataset used to construct confusion matrix. If left blank, this defaults to the training data when nfolds = 0.

holdout.fraction

(Optional) Fraction of the training data to hold out for validation.

balance.classes

(Optional) Balance training data class counts via over/under-sampling (for imbalanced data)

max.after.balance.size

Maximum relative size of the training data after balancing class counts (can be less than 1.0)

class.sampling.factors

Desired over/under-sampling ratios per class (lexicographic order).

grid.parallelism

An integer between 1 and 4 (inclusive) indicating how many parallel threads to run during grid search.

Value

An object of class H2OGBMModel with slots key, data, valid (the validation dataset) and model, where the last is a list of the following components:
typeThe type of the tree.
n.treesNumber of trees grown.
oob_errOut of bag error rate.
forestA matrix giving the minimum, mean, and maximum of the tree depth and number of leaves.
confusionConfusion matrix of the prediction when classification model is specified.

References

1. Elith, Jane, John R Leathwick, and Trevor Hastie. "A Working Guide to Boosted Regression Trees." Journal of Animal Ecology 77.4 (2008): 802-813

2. Friedman, Jerome, Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. "Discussion of Boosting Papers." Ann. Statist 32 (2004): 102-107

3. Hastie, Trevor, Robert Tibshirani, and J Jerome H Friedman. The Elements of Statistical Learning. Vol.1. N.p.: Springer New York, 2001. http://www.stanford.edu/~hastie/local.ftp/Springer/OLD//ESLII_print4.pdf

Examples

Run this code

# -- CRAN examples begin --
library(h2o)
localH2O = h2o.init()

# Run regression GBM on australia.hex data 
ausPath = system.file("extdata", "australia.csv", package="h2o")
australia.hex = h2o.importFile(localH2O, path = ausPath)
independent <- c("premax", "salmax","minairtemp", "maxairtemp", "maxsst", 
  "maxsoilmoist", "Max_czcs")
dependent <- "runoffnew"
h2o.gbm(y = dependent, x = independent, data = australia.hex, n.trees = 3, interaction.depth = 3, 
  n.minobsinnode = 2, shrinkage = 0.2, distribution= "gaussian")
# -- CRAN examples end --

# Run multinomial classification GBM on australia data 
h2o.gbm(y = dependent, x = independent, data = australia.hex, n.trees = 3, interaction.depth = 3, 
  n.minobsinnode = 2, shrinkage = 0.01, distribution= "multinomial")

# GBM variable importance
# Also see:
#   https://github.com/h2oai/h2o/blob/master/R/tests/testdir_demos/runit_demo_VI_all_algos.R
data.hex = h2o.importFile(
  localH2O,
  path = "https://raw.github.com/h2oai/h2o/master/smalldata/bank-additional-full.csv",
  key = "data.hex")
myX = 1:20
myY="y"
my.gbm <- h2o.gbm(x = myX, y = myY, distribution = "bernoulli", data = data.hex, n.trees =100,
                  interaction.depth = 2, shrinkage = 0.01, importance = T)
gbm.VI = my.gbm@model$varimp
print(gbm.VI)
barplot(t(gbm.VI[1]),las=2,main="VI from GBM")