AutoXGBoostHurdleModel: AutoXGBoostHurdleModel

Description

AutoXGBoostHurdleModel is generalized hurdle modeling framework

Usage

AutoXGBoostHurdleModel(
  TreeMethod = "hist",
  TrainOnFull = FALSE,
  PassInGrid = NULL,
  NThreads = max(1L, parallel::detectCores() - 2L),
  ModelID = "ModelTest",
  Paths = NULL,
  MetaDataPaths = NULL,
  data,
  ValidationData = NULL,
  TestData = NULL,
  Buckets = 0L,
  TargetColumnName = NULL,
  FeatureColNames = NULL,
  IDcols = NULL,
  EncodingMethod = "binary",
  TransformNumericColumns = NULL,
  SplitRatios = c(0.7, 0.2, 0.1),
  SaveModelObjects = FALSE,
  ReturnModelObjects = TRUE,
  NumOfParDepPlots = 10L,
  GridTune = FALSE,
  grid_eval_metric = "accuracy",
  MaxModelsInGrid = 1L,
  BaselineComparison = "default",
  MaxRunsWithoutNewWinner = 10L,
  MaxRunMinutes = 60L,
  Trees = list(classifier = seq(1000, 2000, 100), regression = seq(1000, 2000, 100)),
  eta = list(classifier = seq(0.05, 0.4, 0.05), regression = seq(0.05, 0.4, 0.05)),
  max_depth = list(classifier = seq(4L, 16L, 2L), regression = seq(4L, 16L, 2L)),
  min_child_weight = list(classifier = seq(1, 10, 1), regression = seq(1, 10, 1)),
  subsample = list(classifier = seq(0.55, 1, 0.05), regression = seq(0.55, 1, 0.05)),
  colsample_bytree = list(classifier = seq(0.55, 1, 0.05), regression = seq(0.55, 1,
    0.05))
)

Arguments

TreeMethod

Set to hist or gpu_hist depending on if you have an xgboost installation capable of gpu processing

TrainOnFull

Set to TRUE to train model on 100 percent of data

PassInGrid

Pass in a grid for changing up the parameter settings for catboost

NThreads

Set to the number of threads you would like to dedicate to training

ModelID

Define a character name for your models

Paths

The path to your folder where you want your model information saved

MetaDataPaths

A character string of your path file to where you want your model evaluation output saved. If left NULL, all output will be saved to Paths.

data

Source training data. Do not include a column that has the class labels for the buckets as they are created internally.

ValidationData

Source validation data. Do not include a column that has the class labels for the buckets as they are created internally.

TestData

Souce test data. Do not include a column that has the class labels for the buckets as they are created internally.

Buckets

A numeric vector of the buckets used for subsetting the data. NOTE: the final Bucket value will first create a subset of data that is less than the value and a second one thereafter for data greater than the bucket value.

TargetColumnName

Supply the column name or number for the target variable

FeatureColNames

Supply the column names or number of the features (not included the PrimaryDateColumn)

IDcols

Includes PrimaryDateColumn and any other columns you want returned in the validation data with predictions

EncodingMethod

Choose from 'binary', 'poly_encode', 'backward_difference', 'helmert' for multiclass cases and additionally 'm_estimator', 'credibility', 'woe', 'target_encoding' for classification use cases.

TransformNumericColumns

Transform numeric column inside the AutoCatBoostRegression() function

SplitRatios

Supply vector of partition ratios. For example, c(0.70,0.20,0,10).

SaveModelObjects

Set to TRUE to save the model objects to file in the folders listed in Paths

ReturnModelObjects

Set to TRUE to return all model objects

NumOfParDepPlots

Set to pull back N number of partial dependence calibration plots.

GridTune

Set to TRUE if you want to grid tune the models

grid_eval_metric

Select the metric to optimize in grid tuning. "accuracy", "microauc", "logloss"

MaxModelsInGrid

Set to a numeric value for the number of models to try in grid tune

BaselineComparison

"default"

MaxRunsWithoutNewWinner

Number of runs without a new winner before stopping the grid tuning

MaxRunMinutes

Max number of minutes to allow the grid tuning to run for

Trees

Provide a named list to have different number of trees for each model. Trees = list("classifier" = seq(1000,2000,100), "regression" = seq(1000,2000,100))

eta

Provide a named list to have different number of eta for each model.

max_depth

Provide a named list to have different number of max_depth for each model.

min_child_weight

Provide a named list to have different number of min_child_weight for each model.

subsample

Provide a named list to have different number of subsample for each model.

colsample_bytree

Provide a named list to have different number of colsample_bytree for each model.

Value

Returns AutoXGBoostRegression() model objects: VariableImportance.csv, Model, ValidationData.csv, EvalutionPlot.png, EvalutionBoxPlot.png, EvaluationMetrics.csv, ParDepPlots.R a named list of features with partial dependence calibration plots, ParDepBoxPlots.R, GridCollect, and the grid used

Examples

Run this code

# NOT RUN {
Output <- RemixAutoML::AutoXGBoostHurdleModel(

   # Operationalization args
   TreeMethod = "hist",
   TrainOnFull = FALSE,
   PassInGrid = NULL,

   # Metadata args
   NThreads = max(1L, parallel::detectCores()-2L),
   ModelID = "ModelTest",
   Paths = normalizePath("./"),
   MetaDataPaths = NULL,

   # data args
   data,
   ValidationData = NULL,
   TestData = NULL,
   Buckets = 0L,
   TargetColumnName = NULL,
   FeatureColNames = NULL,
   IDcols = NULL,

   # options
   EncodingMethod = "binary",
   TransformNumericColumns = NULL,
   SplitRatios = c(0.70, 0.20, 0.10),
   ReturnModelObjects = TRUE,
   SaveModelObjects = FALSE,
   NumOfParDepPlots = 10L,

   # grid tuning args
   GridTune = FALSE,
   grid_eval_metric = "accuracy",
   MaxModelsInGrid = 1L,
   BaselineComparison = "default",
   MaxRunsWithoutNewWinner = 10L,
   MaxRunMinutes = 60L,

   # bandit hyperparameters
   Trees = list("classifier" = seq(1000,2000,100),
                "regression" = seq(1000,2000,100)),
   eta = list("classifier" = seq(0.05,0.40,0.05),
              "regression" = seq(0.05,0.40,0.05)),
   max_depth = list("classifier" = seq(4L,16L,2L),
                    "regression" = seq(4L,16L,2L)),

   # random hyperparameters
   min_child_weight = list("classifier" = seq(1.0,10.0,1.0),
                           "regression" = seq(1.0,10.0,1.0)),
   subsample = list("classifier" = seq(0.55,1.0,0.05),
                    "regression" = seq(0.55,1.0,0.05)),
   colsample_bytree = list("classifier" = seq(0.55,1.0,0.05),
                           "regression" = seq(0.55,1.0,0.05)))
# }

Run the code above in your browser using DataLab

Get 50% off unlimited learning