boost_tree()
is a way to generate a specification of a model
before fitting and allows the model to be created using
different packages in R or via Spark. The main arguments for the
model are:
mtry
: The number of predictors that will be
randomly sampled at each split when creating the tree models.
trees
: The number of trees contained in the ensemble.
min_n
: The minimum number of data points in a node
that are required for the node to be split further.
tree_depth
: The maximum depth of the tree (i.e. number of
splits).
learn_rate
: The rate at which the boosting algorithm adapts
from iteration-to-iteration.
loss_reduction
: The reduction in the loss function required
to split further.
sample_size
: The amount of data exposed to the fitting routine.
These arguments are converted to their specific names at the
time that the model is fit. Other options and argument can be
set using the set_engine()
function. If left to their defaults
here (NULL
), the values are taken from the underlying model
functions. If parameters need to be modified, update()
can be used
in lieu of recreating the object from scratch.
boost_tree(
mode = "unknown",
mtry = NULL,
trees = NULL,
min_n = NULL,
tree_depth = NULL,
learn_rate = NULL,
loss_reduction = NULL,
sample_size = NULL
)# S3 method for boost_tree
update(
object,
parameters = NULL,
mtry = NULL,
trees = NULL,
min_n = NULL,
tree_depth = NULL,
learn_rate = NULL,
loss_reduction = NULL,
sample_size = NULL,
fresh = FALSE,
...
)
A single character string for the type of model. Possible values for this model are "unknown", "regression", or "classification".
A number for the number (or proportion) of predictors that will
be randomly sampled at each split when creating the tree models (xgboost
only).
An integer for the number of trees contained in the ensemble.
An integer for the minimum number of data points in a node that are required for the node to be split further.
An integer for the maximum depth of the tree (i.e. number
of splits) (xgboost
only).
A number for the rate at which the boosting algorithm adapts
from iteration-to-iteration (xgboost
only).
A number for the reduction in the loss function required
to split further (xgboost
only).
A number for the number (or proportion) of data that is
exposed to the fitting routine. For xgboost
, the sampling is done at at
each iteration while C5.0
samples once during training.
A boosted tree model specification.
A 1-row tibble or named list with main
parameters to update. If the individual arguments are used,
these will supersede the values in parameters
. Also, using
engine arguments in this object will result in an error.
A logical for whether the arguments should be modified in-place of or replaced wholesale.
Not used for update()
.
An updated model specification.
Engines may have pre-set default arguments when executing the model fit call. For this type of model, the template of the fit calls are below:
boost_tree() %>% set_engine("xgboost") %>% set_mode("regression") %>% translate()
## Boosted Tree Model Specification (regression) ## ## Computational engine: xgboost ## ## Model fit template: ## parsnip::xgb_train(x = missing_arg(), y = missing_arg(), nthread = 1, ## verbose = 0)
boost_tree() %>% set_engine("xgboost") %>% set_mode("classification") %>% translate()
## Boosted Tree Model Specification (classification) ## ## Computational engine: xgboost ## ## Model fit template: ## parsnip::xgb_train(x = missing_arg(), y = missing_arg(), nthread = 1, ## verbose = 0)
boost_tree() %>% set_engine("C5.0") %>% set_mode("classification") %>% translate()
## Boosted Tree Model Specification (classification) ## ## Computational engine: C5.0 ## ## Model fit template: ## parsnip::C5.0_train(x = missing_arg(), y = missing_arg(), weights = missing_arg())
Note that C50::C5.0()
does not require factor
predictors to be converted to indicator variables.
boost_tree() %>% set_engine("spark") %>% set_mode("regression") %>% translate()
## Boosted Tree Model Specification (regression) ## ## Computational engine: spark ## ## Model fit template: ## sparklyr::ml_gradient_boosted_trees(x = missing_arg(), formula = missing_arg(), ## type = "regression", seed = sample.int(10^5, 1))
boost_tree() %>% set_engine("spark") %>% set_mode("classification") %>% translate()
## Boosted Tree Model Specification (classification) ## ## Computational engine: spark ## ## Model fit template: ## sparklyr::ml_gradient_boosted_trees(x = missing_arg(), formula = missing_arg(), ## type = "classification", seed = sample.int(10^5, 1))
The standardized parameter names in parsnip can be mapped to their original names in each engine that has main parameters:
parsnip | xgboost | C5.0 | spark |
tree_depth | max_depth | NA | max_depth |
trees | nrounds | trials | max_iter |
learn_rate | eta | NA | step_size |
mtry | colsample_bytree | NA | feature_subset_strategy |
min_n | min_child_weight | minCases | min_instances_per_node |
loss_reduction | gamma | NA | NA |
sample_size | subsample | sample | subsampling_rate |
min_info_gain | NA | NA | loss_reduction |
The data given to the function are not saved and are only used
to determine the mode of the model. For boost_tree()
, the
possible modes are "regression" and "classification".
The model can be created using the fit()
function using the
following engines:
R: "xgboost"
(the default), "C5.0"
Spark: "spark"
# NOT RUN {
boost_tree(mode = "classification", trees = 20)
# Parameters can be represented by a placeholder:
boost_tree(mode = "regression", mtry = varying())
model <- boost_tree(mtry = 10, min_n = 3)
model
update(model, mtry = 1)
update(model, mtry = 1, fresh = TRUE)
param_values <- tibble::tibble(mtry = 10, tree_depth = 5)
model %>% update(param_values)
model %>% update(param_values, mtry = 3)
param_values$verbose <- 0
# Fails due to engine argument
# model %>% update(param_values)
# }
Run the code above in your browser using DataLab