ml_random_forest: Spark ML -- Random Forests

Description

Perform regression or classification using random forests with a Spark DataFrame.

Usage

ml_random_forest(x, response, features, max.bins = 32L, max.depth = 5L,
  num.trees = 20L, type = c("auto", "regression", "classification"),
  ml.options = ml_options(), ...)

Arguments

An object coercable to a Spark DataFrame (typically, a tbl_spark).

response

The name of the response vector (as a length-one character vector), or a formula, giving a symbolic description of the model to be fitted. When response is a formula, it is used in preference to other parameters to set the response, features, and intercept parameters (if available). Currently, only simple linear combinations of existing parameters is supposed; e.g. response ~ feature1 + feature2 + .... The intercept term can be omitted by using - 1 in the model fit.

features

The name of features (terms) to use for the model fit.

max.bins

The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity.

max.depth

Maximum depth of the tree (>= 0); that is, the maximum number of nodes separating any leaves from the root of the tree.

num.trees

Number of trees to train (>= 1).

type

The type of model to fit. "regression" treats the response as a continuous variable, while "classification" treats the response as a categorical variable. When "auto" is used, the model type is inferred based on the response variable type -- if it is a numeric type, then regression is used; classification otherwise.

ml.options

Optional arguments, used to affect the model generated. See ml_options for more details.

...

Optional arguments. The data argument can be used to specify the data to be used when x is a formula; this allows calls of the form ml_linear_regression(y ~ x, data = tbl), and is especially useful in conjunction with do.

Description

Usage

Arguments

See Also