ml-tuning
Spark ML -- Tuning
Perform hyper-parameter tuning using either K-fold cross validation or train-validation split.
Usage
ml_sub_models(model)ml_validation_metrics(model)
ml_cross_validator(x, estimator, estimator_param_maps, evaluator,
num_folds = 3L, collect_sub_models = FALSE, parallelism = 1L,
seed = NULL, uid = random_string("cross_validator_"), ...)
ml_train_validation_split(x, estimator, estimator_param_maps, evaluator,
train_ratio = 0.75, collect_sub_models = FALSE, parallelism = 1L,
seed = NULL, uid = random_string("train_validation_split_"), ...)
Arguments
- model
A cross validation or train-validation-split model.
- x
A
spark_connection
,ml_pipeline
, or atbl_spark
.- estimator
A
ml_estimator
object.- estimator_param_maps
A named list of stages and hyper-parameter sets to tune. See details.
- evaluator
A
ml_evaluator
object, see ml_evaluator.- num_folds
Number of folds for cross validation. Must be >= 2. Default: 3
- collect_sub_models
Whether to collect a list of sub-models trained during tuning. If set to
FALSE
, then only the single best sub-model will be available after fitting. If set to true, then all sub-models will be available. Warning: For large models, collecting all sub-models can cause OOMs on the Spark driver.- parallelism
The number of threads to use when running parallel algorithms. Default is 1 for serial execution.
- seed
A random seed. Set this value if you need your results to be reproducible across repeated calls.
- uid
A character string used to uniquely identify the ML estimator.
- ...
Optional arguments; currently unused.
- train_ratio
Ratio between train and validation data. Must be between 0 and 1. Default: 0.75
Details
ml_cross_validator()
performs k-fold cross validation while ml_train_validation_split()
performs tuning on one pair of train and validation datasets.
Value
The object returned depends on the class of x
.
spark_connection
: Whenx
is aspark_connection
, the function returns an instance of aml_cross_validator
orml_traing_validation_split
object.ml_pipeline
: Whenx
is aml_pipeline
, the function returns aml_pipeline
with the tuning estimator appended to the pipeline.tbl_spark
: Whenx
is atbl_spark
, a tuning estimator is constructed then immediately fit with the inputtbl_spark
, returning aml_cross_validation_model
or aml_train_validation_split_model
object.
For cross validation, ml_sub_models()
returns a nested
list of models, where the first layer represents fold indices and the
second layer represents param maps. For train-validation split,
ml_sub_models()
returns a list of models, corresponding to the
order of the estimator param maps.
ml_validation_metrics()
returns a data frame of performance
metrics and hyperparameter combinations.