ml_gbt_classifier
Spark ML -- Gradient Boosted Trees
Perform classification and regression using gradient boosted trees.
Usage
ml_gbt_classifier(x, formula = NULL, max_iter = 20L, max_depth = 5L,
step_size = 0.1, subsampling_rate = 1, min_instances_per_node = 1L,
max_bins = 32L, min_info_gain = 0, loss_type = "logistic",
seed = NULL, thresholds = NULL, checkpoint_interval = 10L,
cache_node_ids = FALSE, max_memory_in_mb = 256L,
features_col = "features", label_col = "label",
prediction_col = "prediction", probability_col = "probability",
raw_prediction_col = "rawPrediction",
uid = random_string("gbt_classifier_"), ...)ml_gradient_boosted_trees(x, formula = NULL, type = c("auto", "regression",
"classification"), features_col = "features", label_col = "label",
prediction_col = "prediction", probability_col = "probability",
raw_prediction_col = "rawPrediction", checkpoint_interval = 10L,
loss_type = c("auto", "logistic", "squared", "absolute"), max_bins = 32L,
max_depth = 5L, max_iter = 20L, min_info_gain = 0,
min_instances_per_node = 1L, step_size = 0.1, subsampling_rate = 1,
seed = NULL, thresholds = NULL, cache_node_ids = FALSE,
max_memory_in_mb = 256L, uid = random_string("gradient_boosted_trees_"),
response = NULL, features = NULL, ...)
ml_gbt_regressor(x, formula = NULL, max_iter = 20L, max_depth = 5L,
step_size = 0.1, subsampling_rate = 1, min_instances_per_node = 1L,
max_bins = 32L, min_info_gain = 0, loss_type = "squared", seed = NULL,
checkpoint_interval = 10L, cache_node_ids = FALSE,
max_memory_in_mb = 256L, features_col = "features", label_col = "label",
prediction_col = "prediction", uid = random_string("gbt_regressor_"), ...)
Arguments
- x
A
spark_connection
,ml_pipeline
, or atbl_spark
.- formula
Used when
x
is atbl_spark
. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see ft_r_formula for details.- max_iter
Maxmimum number of iterations.
- max_depth
Maximum depth of the tree (>= 0); that is, the maximum number of nodes separating any leaves from the root of the tree.
- step_size
Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of each estimator. (default = 0.1)
- subsampling_rate
Fraction of the training data used for learning each decision tree, in range (0, 1]. (default = 1.0)
- min_instances_per_node
Minimum number of instances each child must have after split.
- max_bins
The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity.
- min_info_gain
Minimum information gain for a split to be considered at a tree node. Should be >= 0, defaults to 0.
- loss_type
Loss function which GBT tries to minimize. Supported:
"squared"
(L2) and"absolute"
(L1) (default = squared) for regression and"logistic"
(default) for classification. Forml_gradient_boosted_trees
, setting"auto"
will default to the appropriate loss type based on model type.- seed
Seed for random numbers.
- thresholds
Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value
p/t
is predicted, wherep
is the original probability of that class andt
is the class's threshold.- checkpoint_interval
Set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations, defaults to 10.
- cache_node_ids
If
FALSE
, the algorithm will pass trees to executors to match instances with nodes. IfTRUE
, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Defaults toFALSE
.- max_memory_in_mb
Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size. Defaults to 256.
- features_col
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by
ft_r_formula
.- label_col
Label column name. The column should be a numeric column. Usually this column is output by
ft_r_formula
.- prediction_col
Prediction column name.
- probability_col
Column name for predicted class conditional probabilities.
- raw_prediction_col
Raw prediction (a.k.a. confidence) column name.
- uid
A character string used to uniquely identify the ML estimator.
- ...
Optional arguments; see Details.
- type
The type of model to fit.
"regression"
treats the response as a continuous variable, while"classification"
treats the response as a categorical variable. When"auto"
is used, the model type is inferred based on the response variable type -- if it is a numeric type, then regression is used; classification otherwise.- response
(Deprecated) The name of the response column (as a length-one character vector.)
- features
(Deprecated) The name of features (terms) to use for the model fit.
Details
When x
is a tbl_spark
and formula
(alternatively, response
and features
) is specified, the function returns a ml_model
object wrapping a ml_pipeline_model
which contains data pre-processing transformers, the ML predictor, and, for classification models, a post-processing transformer that converts predictions into class labels. For classification, an optional argument predicted_label_col
(defaults to "predicted_label"
) can be used to specify the name of the predicted label column. In addition to the fitted ml_pipeline_model
, ml_model
objects also contain a ml_pipeline
object where the ML predictor stage is an estimator ready to be fit against data. This is utilized by ml_save
with type = "pipeline"
to faciliate model refresh workflows.
ml_gradient_boosted_trees
is a wrapper around ml_gbt_regressor.tbl_spark
and ml_gbt_classifier.tbl_spark
and calls the appropriate method based on model type.
Value
The object returned depends on the class of x
.
spark_connection
: Whenx
is aspark_connection
, the function returns an instance of aml_predictor
object. The object contains a pointer to a SparkPredictor
object and can be used to composePipeline
objects.ml_pipeline
: Whenx
is aml_pipeline
, the function returns aml_pipeline
with the predictor appended to the pipeline.tbl_spark
: Whenx
is atbl_spark
, a predictor is constructed then immediately fit with the inputtbl_spark
, returning a prediction model.tbl_spark
, withformula
: specified Whenformula
is specified, the inputtbl_spark
is first transformed using aRFormula
transformer before being fit by the predictor. The object returned in this case is aml_model
which is a wrapper of aml_pipeline_model
.
See Also
See http://spark.apache.org/docs/latest/ml-classification-regression.html for more information on the set of supervised learning algorithms.
Other ml algorithms: ml_aft_survival_regression
,
ml_decision_tree_classifier
,
ml_generalized_linear_regression
,
ml_isotonic_regression
,
ml_linear_regression
,
ml_linear_svc
,
ml_logistic_regression
,
ml_multilayer_perceptron_classifier
,
ml_naive_bayes
,
ml_one_vs_rest
,
ml_random_forest_classifier