boost_tree: General Interface for Boosted Trees

Description

boost_tree() is a way to generate a specification of a model before fitting and allows the model to be created using different packages in R or via Spark. The main arguments for the model are:

mtry: The number of predictors that will be randomly sampled at each split when creating the tree models.
trees: The number of trees contained in the ensemble.
min_n: The minimum number of data points in a node that are required for the node to be split further.
tree_depth: The maximum depth of the tree (i.e. number of splits).
learn_rate: The rate at which the boosting algorithm adapts from iteration-to-iteration.
loss_reduction: The reduction in the loss function required to split further.
sample_size: The amount of data exposed to the fitting routine.

These arguments are converted to their specific names at the time that the model is fit. Other options and argument can be set using the set_engine() function. If left to their defaults here (NULL), the values are taken from the underlying model functions. If parameters need to be modified, update() can be used in lieu of recreating the object from scratch.

Usage

boost_tree(
  mode = "unknown",
  mtry = NULL,
  trees = NULL,
  min_n = NULL,
  tree_depth = NULL,
  learn_rate = NULL,
  loss_reduction = NULL,
  sample_size = NULL
)
# S3 method for boost_tree
update(
  object,
  parameters = NULL,
  mtry = NULL,
  trees = NULL,
  min_n = NULL,
  tree_depth = NULL,
  learn_rate = NULL,
  loss_reduction = NULL,
  sample_size = NULL,
  fresh = FALSE,
  ...
)

Arguments

mode

A single character string for the type of model. Possible values for this model are "unknown", "regression", or "classification".

mtry

A number for the number (or proportion) of predictors that will be randomly sampled at each split when creating the tree models (xgboost only).

trees

An integer for the number of trees contained in the ensemble.

min_n

An integer for the minimum number of data points in a node that are required for the node to be split further.

tree_depth

An integer for the maximum depth of the tree (i.e. number of splits) (xgboost only).

learn_rate

A number for the rate at which the boosting algorithm adapts from iteration-to-iteration (xgboost only).

loss_reduction

A number for the reduction in the loss function required to split further (xgboost only).

sample_size

A number for the number (or proportion) of data that is exposed to the fitting routine. For xgboost, the sampling is done at at each iteration while C5.0 samples once during training.

object

A boosted tree model specification.

parameters

A 1-row tibble or named list with main parameters to update. If the individual arguments are used, these will supersede the values in parameters. Also, using engine arguments in this object will result in an error.

fresh

A logical for whether the arguments should be modified in-place of or replaced wholesale.

...

Not used for update().

Value

An updated model specification.

Engine Details

Engines may have pre-set default arguments when executing the model fit call. For this type of model, the template of the fit calls are below:

xgboost

boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("regression") %>% 
  translate()

## Boosted Tree Model Specification (regression)
## 
## Computational engine: xgboost 
## 
## Model fit template:
## parsnip::xgb_train(x = missing_arg(), y = missing_arg(), nthread = 1, 
##     verbose = 0)

boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("classification") %>% 
  translate()

## Boosted Tree Model Specification (classification)
## 
## Computational engine: xgboost 
## 
## Model fit template:
## parsnip::xgb_train(x = missing_arg(), y = missing_arg(), nthread = 1, 
##     verbose = 0)

C5.0

boost_tree() %>% 
  set_engine("C5.0") %>% 
  set_mode("classification") %>% 
  translate()

## Boosted Tree Model Specification (classification)
## 
## Computational engine: C5.0 
## 
## Model fit template:
## parsnip::C5.0_train(x = missing_arg(), y = missing_arg(), weights = missing_arg())

Note that C50::C5.0() does not require factor predictors to be converted to indicator variables.

spark

boost_tree() %>% 
  set_engine("spark") %>% 
  set_mode("regression") %>% 
  translate()

## Boosted Tree Model Specification (regression)
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_gradient_boosted_trees(x = missing_arg(), formula = missing_arg(), 
##     type = "regression", seed = sample.int(10^5, 1))

boost_tree() %>% 
  set_engine("spark") %>% 
  set_mode("classification") %>% 
  translate()

## Boosted Tree Model Specification (classification)
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_gradient_boosted_trees(x = missing_arg(), formula = missing_arg(), 
##     type = "classification", seed = sample.int(10^5, 1))

Parameter translations

The standardized parameter names in parsnip can be mapped to their original names in each engine that has main parameters:

parsnip	xgboost	C5.0	spark
tree_depth	max_depth	NA	max_depth
trees	nrounds	trials	max_iter
learn_rate	eta	NA	step_size
mtry	colsample_bytree	NA	feature_subset_strategy
min_n	min_child_weight	minCases	min_instances_per_node
loss_reduction	gamma	NA	NA
sample_size	subsample	sample	subsampling_rate
min_info_gain	NA	NA	loss_reduction

Details

The data given to the function are not saved and are only used to determine the mode of the model. For boost_tree(), the possible modes are "regression" and "classification".

The model can be created using the fit() function using the following engines:

R: "xgboost" (the default), "C5.0"
Spark: "spark"

Examples

Run this code

# NOT RUN {
boost_tree(mode = "classification", trees = 20)
# Parameters can be represented by a placeholder:
boost_tree(mode = "regression", mtry = varying())
model <- boost_tree(mtry = 10, min_n = 3)
model
update(model, mtry = 1)
update(model, mtry = 1, fresh = TRUE)

param_values <- tibble::tibble(mtry = 10, tree_depth = 5)

model %>% update(param_values)
model %>% update(param_values, mtry = 3)

param_values$verbose <- 0
# Fails due to engine argument
# model %>% update(param_values)
# }

Run the code above in your browser using DataLab