h2o.randomForest: H2O: Random Forest

Description

Performs random forest classification on a data set.

Usage

## Default method:
h2o.randomForest(x, y, data, classification = TRUE, ntree = 50, depth = 20, 
  sample.rate = 2/3, classwt = NULL, nbins = 100, seed = -1, importance = FALSE, 
  validation, nodesize = 1, balance.classes = FALSE, max.after.balance.size = 5,
  use_non_local = TRUE, version = 2)

## Import to a ValueArray object:
h2o.randomForest.VA(x, y, data, ntree = 50, depth = 20, sample.rate = 2/3, 
  classwt = NULL, nbins = 100, seed = -1, use_non_local = TRUE)

## Import to a FluidVecs object:
h2o.randomForest.FV(x, y, data, classification = TRUE, ntree = 50, depth = 20, 
  sample.rate = 2/3, nbins = 100, seed = -1, importance = FALSE, validation, 
  nodesize = 1, balance.classes = FALSE, max.after.balance.size = 5)

Arguments

A vector containing the names or indices of the predictor variables to use in building the random forest model.

The name or index of the response variable. If the data does not contain a header, this is the column index, designated by increasing numbers from left to right. (The response must be either an integer or a categorical variable).

data

An H2OParsedDataVA (version = 1) or H2OParsedData (version = 2) object containing the variables in the model.

classification

(Optional) A logical value indicating whether a classification model should be built (as opposed to regression).

ntree

(Optional) Number of trees to grow. (Must be a nonnegative integer).

depth

(Optional) Maximum depth to grow the tree.

sample.rate

(Optional) Sampling rate for constructing data from which individual trees are grown.

classwt

(Optional) Numeric vector of class weights for a categorical response.

nbins

(Optional) Build a histogram of this many bins, then split at best point.

seed

(Optional) Seed for building the random forest. If seed = -1, one will automatically be generated by H2O.

importance

(Optional) A logical value indicating whether to calculate variable importance. Set to FALSE to speed up computations.

validation

(Optional) An H2OParsedDataVA (version = 1) or H2OParsedData (version = 2) object indicating the validation dataset used to construct confusion matri

nodesize

(Optional) Number of nodes to use for computation.

balance.classes

(Optional) Balance training data class counts via over/under-sampling (for imbalanced data)

max.after.balance.size

Maximum relative size of the training data after balancing class counts (can be less than 1.0)

use_non_local

(Optional) Logical value indicating whether to use non-local data in building random forest model.

version

(Optional) The version of random forest to run. If version = 1, this will run the single-node ValueArray implementation, while version = 2 selects the distributed, but still beta stage FluidVecs implementation.

Value

An object of class H2ORFModelVA (version = 1) or H2ODRFModel (version = 2) with slots key, data, and model, where the last is a list of the following components:
ntreeNumber of trees grown.
mseMean-squared error for each tree.
forestA matrix giving the minimum, mean, and maximum of the tree depth and number of leaves.
confusionConfusion matrix of the prediction.

Details

IMPORTANT: Currently, to run k-means with version = 1, you must import data to a ValueArray object using h2o.importFile.VA, h2o.importFolder.VA or one of its variants. To run with version = 2, you must import data to a FluidVecs object using h2o.importFile.FV, h2o.importFolder.FV or one of its variants.

Examples

Run this code

# Run an RF model on iris data
library(h2o)
localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE)
irisPath = system.file("extdata", "iris.csv", package = "h2o")
iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex")
h2o.randomForest(y = 5, x = c(2,3,4), data = iris.hex, ntree = 50, depth = 100)

Run the code above in your browser using DataLab