train_iblm_xgb: Train IBLM Model on XGBoost

Description

This function trains an interpretable boosted linear model.

The function combines a Generalized Linear Model (GLM) with a booster model of XGBoost

The "booster" model is trained on: - actual responses / GLM predictions, when the link function is log - actual responses - GLM predictions, when the link function is identity

Usage

train_iblm_xgb(
  df_list,
  response_var,
  family = "poisson",
  params = list(),
  nrounds = 1000,
  objective = NULL,
  custom_metric = NULL,
  verbose = 0,
  print_every_n = 1L,
  early_stopping_rounds = 25,
  maximize = NULL,
  save_period = NULL,
  save_name = "xgboost.model",
  xgb_model = NULL,
  callbacks = list(),
  ...,
  strip_glm = TRUE
)

Value

An object of class "iblm" containing:

glm_model: The GLM model object, fitted on the `df_list$train` data that was provided
booster_model: The booster model object, trained on the residuals leftover from the glm_model
data: A list containing the data that was used to train and validate this iblm model
relationship: String that explains how to combine the `glm_model` and `booster_model`. Currently only either "Additive" or "Multiplicative"
response_var: A string describing the response variable used for this iblm model
predictor_vars: A list describing the predictor variables used for this iblm model
cat_levels: A list describing the categorical levels for the predictor vars
coeff_names: A list describing the coefficient names

Arguments

df_list: A named list containing training and validation datasets. Must have elements named "train" and "validate", each containing df_list frames with the same structure. This item is naturally output from the function [split_into_train_validate_test()]
response_var: Character string specifying the name of the response variable column in the datasets. The string MUST appear in both `df_list$train` and `df_list$validate`.
family: Character string specifying the distributional family for the model. Currently only "poisson", "gamma", "tweedie" and "gaussian" is fully supported. See details for how this impacts fitting.
params: Named list of additional parameters to pass to xgb.train. Note that train_iblm_xgb will select "objective" and "base_score" for you depending on `family` (see details section). However you may overwrite these (do so with caution)
nrounds, objective, custom_metric, verbose, print_every_n, early_stopping_rounds, maximize, save_period, save_name, xgb_model, callbacks, ...: These are passed directly to xgb.train
strip_glm: TRUE/FALSE, whether to strip superfluous data from the `glm_model` object saved within `iblm` class that is output. Only serves to reduce memory constraints.

Details

The `family` argument will be fed into the GLM fitting. Default `params` values for the XGBoost fitting are also selected based on family:

For "poisson" family, the "objective" is set to "count:poisson"
For "gamma" family, the "objective" is set to "reg:gamma"
For "tweedie" family, the "objective" is set to "reg:tweedie". Also, "tweedie_variance_power = 1.5".
For "gaussian" family, the "objective" is set to "reg:squarederror"

Note: Any xgboost configuration below will be overwritten by any explicit arguments input into `train_iblm_xgb()`

Examples

Run this code

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)