AutoXGBoostHurdleModel is generalized hurdle modeling framework
AutoXGBoostHurdleModel(
TreeMethod = "hist",
TrainOnFull = FALSE,
PassInGrid = NULL,
NThreads = max(1L, parallel::detectCores() - 2L),
ModelID = "ModelTest",
Paths = NULL,
MetaDataPaths = NULL,
data,
ValidationData = NULL,
TestData = NULL,
Buckets = 0L,
TargetColumnName = NULL,
FeatureColNames = NULL,
IDcols = NULL,
EncodingMethod = "binary",
TransformNumericColumns = NULL,
SplitRatios = c(0.7, 0.2, 0.1),
SaveModelObjects = FALSE,
ReturnModelObjects = TRUE,
NumOfParDepPlots = 10L,
GridTune = FALSE,
grid_eval_metric = "accuracy",
MaxModelsInGrid = 1L,
BaselineComparison = "default",
MaxRunsWithoutNewWinner = 10L,
MaxRunMinutes = 60L,
Trees = list(classifier = seq(1000, 2000, 100), regression = seq(1000, 2000, 100)),
eta = list(classifier = seq(0.05, 0.4, 0.05), regression = seq(0.05, 0.4, 0.05)),
max_depth = list(classifier = seq(4L, 16L, 2L), regression = seq(4L, 16L, 2L)),
min_child_weight = list(classifier = seq(1, 10, 1), regression = seq(1, 10, 1)),
subsample = list(classifier = seq(0.55, 1, 0.05), regression = seq(0.55, 1, 0.05)),
colsample_bytree = list(classifier = seq(0.55, 1, 0.05), regression = seq(0.55, 1,
0.05))
)
Set to hist or gpu_hist depending on if you have an xgboost installation capable of gpu processing
Set to TRUE to train model on 100 percent of data
Pass in a grid for changing up the parameter settings for catboost
Set to the number of threads you would like to dedicate to training
Define a character name for your models
The path to your folder where you want your model information saved
A character string of your path file to where you want your model evaluation output saved. If left NULL, all output will be saved to Paths.
Source training data. Do not include a column that has the class labels for the buckets as they are created internally.
Source validation data. Do not include a column that has the class labels for the buckets as they are created internally.
Souce test data. Do not include a column that has the class labels for the buckets as they are created internally.
A numeric vector of the buckets used for subsetting the data. NOTE: the final Bucket value will first create a subset of data that is less than the value and a second one thereafter for data greater than the bucket value.
Supply the column name or number for the target variable
Supply the column names or number of the features (not included the PrimaryDateColumn)
Includes PrimaryDateColumn and any other columns you want returned in the validation data with predictions
Choose from 'binary', 'poly_encode', 'backward_difference', 'helmert' for multiclass cases and additionally 'm_estimator', 'credibility', 'woe', 'target_encoding' for classification use cases.
Transform numeric column inside the AutoCatBoostRegression() function
Supply vector of partition ratios. For example, c(0.70,0.20,0,10).
Set to TRUE to save the model objects to file in the folders listed in Paths
Set to TRUE to return all model objects
Set to pull back N number of partial dependence calibration plots.
Set to TRUE if you want to grid tune the models
Select the metric to optimize in grid tuning. "accuracy", "microauc", "logloss"
Set to a numeric value for the number of models to try in grid tune
"default"
Number of runs without a new winner before stopping the grid tuning
Max number of minutes to allow the grid tuning to run for
Provide a named list to have different number of trees for each model. Trees = list("classifier" = seq(1000,2000,100), "regression" = seq(1000,2000,100))
Provide a named list to have different number of eta for each model.
Provide a named list to have different number of max_depth for each model.
Provide a named list to have different number of min_child_weight for each model.
Provide a named list to have different number of subsample for each model.
Provide a named list to have different number of colsample_bytree for each model.
Returns AutoXGBoostRegression() model objects: VariableImportance.csv, Model, ValidationData.csv, EvalutionPlot.png, EvalutionBoxPlot.png, EvaluationMetrics.csv, ParDepPlots.R a named list of features with partial dependence calibration plots, ParDepBoxPlots.R, GridCollect, and the grid used
Other Supervised Learning - Compound:
AutoCatBoostHurdleModel()
,
AutoH2oDRFHurdleModel()
,
AutoH2oGBMHurdleModel()
# NOT RUN {
Output <- RemixAutoML::AutoXGBoostHurdleModel(
# Operationalization args
TreeMethod = "hist",
TrainOnFull = FALSE,
PassInGrid = NULL,
# Metadata args
NThreads = max(1L, parallel::detectCores()-2L),
ModelID = "ModelTest",
Paths = normalizePath("./"),
MetaDataPaths = NULL,
# data args
data,
ValidationData = NULL,
TestData = NULL,
Buckets = 0L,
TargetColumnName = NULL,
FeatureColNames = NULL,
IDcols = NULL,
# options
EncodingMethod = "binary",
TransformNumericColumns = NULL,
SplitRatios = c(0.70, 0.20, 0.10),
ReturnModelObjects = TRUE,
SaveModelObjects = FALSE,
NumOfParDepPlots = 10L,
# grid tuning args
GridTune = FALSE,
grid_eval_metric = "accuracy",
MaxModelsInGrid = 1L,
BaselineComparison = "default",
MaxRunsWithoutNewWinner = 10L,
MaxRunMinutes = 60L,
# bandit hyperparameters
Trees = list("classifier" = seq(1000,2000,100),
"regression" = seq(1000,2000,100)),
eta = list("classifier" = seq(0.05,0.40,0.05),
"regression" = seq(0.05,0.40,0.05)),
max_depth = list("classifier" = seq(4L,16L,2L),
"regression" = seq(4L,16L,2L)),
# random hyperparameters
min_child_weight = list("classifier" = seq(1.0,10.0,1.0),
"regression" = seq(1.0,10.0,1.0)),
subsample = list("classifier" = seq(0.55,1.0,0.05),
"regression" = seq(0.55,1.0,0.05)),
colsample_bytree = list("classifier" = seq(0.55,1.0,0.05),
"regression" = seq(0.55,1.0,0.05)))
# }
Run the code above in your browser using DataLab