# ml_decision_tree

##### Spark ML -- Decision Trees

Perform regression or classification using decision trees.

##### Usage

```
ml_decision_tree(x, response, features, impurity = c("auto", "gini",
"entropy", "variance"), max.bins = 32L, max.depth = 5L,
min.info.gain = 0, min.rows = 1L, type = c("auto", "regression",
"classification"), thresholds = NULL, seed = NULL,
checkpoint.interval = 10L, cache.node.ids = FALSE, max.memory = 256L,
ml.options = ml_options(), ...)
```

##### Arguments

- x
An object coercable to a Spark DataFrame (typically, a

`tbl_spark`

).- response
The name of the response vector (as a length-one character vector), or a formula, giving a symbolic description of the model to be fitted. When

`response`

is a formula, it is used in preference to other parameters to set the`response`

,`features`

, and`intercept`

parameters (if available). Currently, only simple linear combinations of existing parameters is supposed; e.g.`response ~ feature1 + feature2 + ...`

. The intercept term can be omitted by using`- 1`

in the model fit.- features
The name of features (terms) to use for the model fit.

- impurity
Criterion used for information gain calculation One of 'auto', 'gini', 'entropy', or 'variance'. 'auto' defaults to 'gini' for classification and 'variance' for regression.

- max.bins
The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity.

- max.depth
Maximum depth of the tree (>= 0); that is, the maximum number of nodes separating any leaves from the root of the tree.

- min.info.gain
Minimum information gain for a split to be considered at a tree node. Should be >= 0, defaults to 0.

- min.rows
Minimum number of instances each child must have after split.

- type
The type of model to fit.

`"regression"`

treats the response as a continuous variable, while`"classification"`

treats the response as a categorical variable. When`"auto"`

is used, the model type is inferred based on the response variable type -- if it is a numeric type, then regression is used; classification otherwise.- thresholds
Thresholds in multi-class classification to adjust the probability of predicting each class. Vector must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.

- seed
Seed for random numbers.

- checkpoint.interval
Set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations, defaults to 10.

- cache.node.ids
If

`FALSE`

, the algorithm will pass trees to executors to match instances with nodes. If`TRUE`

, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Defaults to`FALSE`

.- max.memory
Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size. Defaults to 256.

- ml.options
Optional arguments, used to affect the model generated. See

`ml_options`

for more details.- ...
Optional arguments. The

`data`

argument can be used to specify the data to be used when`x`

is a formula; this allows calls of the form`ml_linear_regression(y ~ x, data = tbl)`

, and is especially useful in conjunction with`do`

.

##### See Also

Other Spark ML routines: `ml_als_factorization`

,
`ml_generalized_linear_regression`

,
`ml_gradient_boosted_trees`

,
`ml_kmeans`

, `ml_lda`

,
`ml_linear_regression`

,
`ml_logistic_regression`

,
`ml_multilayer_perceptron`

,
`ml_naive_bayes`

,
`ml_one_vs_rest`

, `ml_pca`

,
`ml_random_forest`

,
`ml_survival_regression`

*Documentation reproduced from package sparklyr, version 0.6.4, License: Apache License 2.0 | file LICENSE*