h2o.randomForest: Build a Big Data Random Forest Model

Description

Builds a Random Forest Model on an H2OFrame

Usage

h2o.randomForest(x, y, training_frame, model_id, validation_frame, checkpoint,
  mtries = -1, sample_rate = 0.632, build_tree_one_node = FALSE,
  ntrees = 50, max_depth = 20, min_rows = 1, nbins = 20,
  nbins_cats = 1024, binomial_double_trees = FALSE,
  balance_classes = FALSE, max_after_balance_size = 5, seed,
  offset_column = NULL, weights_column = NULL, nfolds = 0,
  fold_column = NULL, fold_assignment = c("AUTO", "Random", "Modulo"),
  keep_cross_validation_predictions = FALSE, ...)

Arguments

A vector containing the names or indices of the predictor variables to use in building the GBM model.

The name or index of the response variable. If the data does not contain a header, this is the column index number starting at 1, and increasing from left to right. (The response must be either an integer or a categorical variable).

training_frame

An H2OFrame object containing the variables in the model.

model_id

(Optional) The unique id assigned to the resulting model. If none is given, an id will automatically be generated.

validation_frame

An H2OFrame object containing the variables in the model.

checkpoint

"Model checkpoint (either key or H2ODeepLearningModel) to resume training with."

mtries

Number of variables randomly sampled as candidates at each split. If set to -1, defaults to sqrt{p} for classification, and p/3 for regression, where p is the number of predictors.

sample_rate

Sample rate, from 0 to 1.0.

build_tree_one_node

Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.

ntrees

A nonnegative integer that determines the number of trees to grow.

max_depth

Maximum depth to grow the tree.

min_rows

Minimum number of rows to assign to teminal nodes.

nbins

For numerical columns (real/int), build a histogram of this many bins, then split at the best point.

nbins_cats

For categorical columns (enum), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.

binomial_double_trees

For binary classification: Build 2x as many trees (one per class) - can lead to higher accuracy.

balance_classes

logical, indicates whether or not to balance training data class counts via over/under-sampling (for imbalanced data)

max_after_balance_size

Maximum relative size of the training data after balancing class counts (can be less than 1.0)

seed

Seed for random numbers (affects sampling) - Note: only reproducible when running single threaded

offset_column

Specify the offset column.

weights_column

Specify the weights column.

nfolds

(Optional) Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.

fold_column

(Optional) Column with cross-validation fold index assignment per observation

fold_assignment

Cross-validation fold assignment scheme, if fold_column is not specified Must be "AUTO", "Random" or "Modulo"

keep_cross_validation_predictions

Whether to keep the predictions of the cross-validation models

...

(Currently Unimplemented)

Value

Creates a H2OModel object of the right type.

Description

Usage

Arguments

Value

See Also