# h2o.gbm

##### Builds gradient boosted classification trees and gradient boosted regression trees on a parsed data set.

The default distribution function will guess the model type based on the response column type. In order to run properly, the response column must be an numeric for "gaussian" or an enum for "bernoulli" or "multinomial".

##### Usage

```
h2o.gbm(x, y, training_frame, model_id = NULL, validation_frame = NULL,
nfolds = 0, keep_cross_validation_predictions = FALSE,
keep_cross_validation_fold_assignment = FALSE,
score_each_iteration = FALSE, score_tree_interval = 0,
fold_assignment = c("AUTO", "Random", "Modulo", "Stratified"),
fold_column = NULL, ignore_const_cols = TRUE, offset_column = NULL,
weights_column = NULL, balance_classes = FALSE,
class_sampling_factors = NULL, max_after_balance_size = 5,
max_hit_ratio_k = 0, ntrees = 50, max_depth = 5, min_rows = 10,
nbins = 20, nbins_top_level = 1024, nbins_cats = 1024,
r2_stopping = Inf, stopping_rounds = 0, stopping_metric = c("AUTO",
"deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "lift_top_group",
"misclassification", "mean_per_class_error"), stopping_tolerance = 0.001,
max_runtime_secs = 0, seed = -1, build_tree_one_node = FALSE,
learn_rate = 0.1, learn_rate_annealing = 1, distribution = c("AUTO",
"bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie",
"laplace", "quantile", "huber"), quantile_alpha = 0.5,
tweedie_power = 1.5, huber_alpha = 0.9, checkpoint = NULL,
sample_rate = 1, sample_rate_per_class = NULL, col_sample_rate = 1,
col_sample_rate_change_per_level = 1, col_sample_rate_per_tree = 1,
min_split_improvement = 1e-05, histogram_type = c("AUTO",
"UniformAdaptive", "Random", "QuantilesGlobal", "RoundRobin"),
max_abs_leafnode_pred = Inf, pred_noise_bandwidth = 0,
categorical_encoding = c("AUTO", "Enum", "OneHotInternal", "OneHotExplicit",
"Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited"),
calibrate_model = FALSE, calibration_frame = NULL)
```

##### Arguments

- x
A vector containing the names or indices of the predictor variables to use in building the model. If x is missing,then all columns except y are used.

- y
The name of the response variable in the model.If the data does not contain a header, this is the first column index, and increasing from left to right. (The response must be either an integer or a categorical variable).

- training_frame
Id of the training data frame (Not required, to allow initial validation of model parameters).

- model_id
Destination id for this model; auto-generated if not specified.

- validation_frame
Id of the validation data frame.

- nfolds
Number of folds for N-fold cross-validation (0 to disable or >= 2). Defaults to 0.

- keep_cross_validation_predictions
`Logical`

. Whether to keep the predictions of the cross-validation models. Defaults to FALSE.- keep_cross_validation_fold_assignment
`Logical`

. Whether to keep the cross-validation fold assignment. Defaults to FALSE.- score_each_iteration
`Logical`

. Whether to score during each iteration of model training. Defaults to FALSE.- score_tree_interval
Score the model after every so many trees. Disabled if set to 0. Defaults to 0.

- fold_assignment
Cross-validation fold assignment scheme, if fold_column is not specified. The 'Stratified' option will stratify the folds based on the response variable, for classification problems. Must be one of: "AUTO", "Random", "Modulo", "Stratified". Defaults to AUTO.

- fold_column
Column with cross-validation fold index assignment per observation.

- ignore_const_cols
`Logical`

. Ignore constant columns. Defaults to TRUE.- offset_column
Offset column. This will be added to the combination of columns before applying the link function.

- weights_column
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.

- balance_classes
`Logical`

. Balance training data class counts via over/under-sampling (for imbalanced data). Defaults to FALSE.- class_sampling_factors
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.

- max_after_balance_size
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. Defaults to 5.0.

- max_hit_ratio_k
Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable) Defaults to 0.

- ntrees
Number of trees. Defaults to 50.

- max_depth
Maximum tree depth. Defaults to 5.

- min_rows
Fewest allowed (weighted) observations in a leaf. Defaults to 10.

- nbins
For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point Defaults to 20.

- nbins_top_level
For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level Defaults to 1024.

- nbins_cats
For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting. Defaults to 1024.

- r2_stopping
r2_stopping is no longer supported and will be ignored if set - please use stopping_rounds, stopping_metric and stopping_tolerance instead. Previous version of H2O would stop making trees when the R^2 metric equals or exceeds this Defaults to 1.797693135e+308.

- stopping_rounds
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable) Defaults to 0.

- stopping_metric
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression) Must be one of: "AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "lift_top_group", "misclassification", "mean_per_class_error". Defaults to AUTO.

- stopping_tolerance
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much) Defaults to 0.001.

- max_runtime_secs
Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0.

- seed
Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default) Defaults to -1 (time-based random number).

- build_tree_one_node
`Logical`

. Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets. Defaults to FALSE.- learn_rate
Learning rate (from 0.0 to 1.0) Defaults to 0.1.

- learn_rate_annealing
Scale the learning rate by this factor after each tree (e.g., 0.99 or 0.999) Defaults to 1.

- distribution
Distribution function Must be one of: "AUTO", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber". Defaults to AUTO.

- quantile_alpha
Desired quantile for Quantile regression, must be between 0 and 1. Defaults to 0.5.

- tweedie_power
Tweedie power for Tweedie regression, must be between 1 and 2. Defaults to 1.5.

- huber_alpha
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1). Defaults to 0.9.

- checkpoint
Model checkpoint to resume training with.

- sample_rate
Row sample rate per tree (from 0.0 to 1.0) Defaults to 1.

- sample_rate_per_class
A list of row sample rates per class (relative fraction for each class, from 0.0 to 1.0), for each tree

- col_sample_rate
Column sample rate (from 0.0 to 1.0) Defaults to 1.

- col_sample_rate_change_per_level
Relative change of the column sampling rate for every level (from 0.0 to 2.0) Defaults to 1.

- col_sample_rate_per_tree
Column sample rate per tree (from 0.0 to 1.0) Defaults to 1.

- min_split_improvement
Minimum relative improvement in squared error reduction for a split to happen Defaults to 1e-05.

- histogram_type
What type of histogram to use for finding optimal split points Must be one of: "AUTO", "UniformAdaptive", "Random", "QuantilesGlobal", "RoundRobin". Defaults to AUTO.

- max_abs_leafnode_pred
Maximum absolute value of a leaf node prediction Defaults to 1.797693135e+308.

- pred_noise_bandwidth
Bandwidth (sigma) of Gaussian multiplicative noise ~N(1,sigma) for tree node predictions Defaults to 0.

- categorical_encoding
Encoding scheme for categorical features Must be one of: "AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited". Defaults to AUTO.

- calibrate_model
`Logical`

. Use Platt Scaling to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities. Defaults to FALSE.- calibration_frame
Calibration frame for Platt Scaling

##### See Also

`predict.H2OModel`

for prediction

##### Examples

```
library(h2o)
h2o.init()
# Run regression GBM on australia.hex data
ausPath <- system.file("extdata", "australia.csv", package="h2o")
australia.hex <- h2o.uploadFile(path = ausPath)
independent <- c("premax", "salmax","minairtemp", "maxairtemp", "maxsst",
"maxsoilmoist", "Max_czcs")
dependent <- "runoffnew"
h2o.gbm(y = dependent, x = independent, training_frame = australia.hex,
ntrees = 3, max_depth = 3, min_rows = 2)
```

*Documentation reproduced from package h2o, version 3.10.5.3, License: Apache License (== 2.0)*