This function trains an interpretable boosted linear model.
The function combines a Generalized Linear Model (GLM) with a booster model of XGBoost
The "booster" model is trained on: - actual responses / GLM predictions, when the link function is log - actual responses - GLM predictions, when the link function is identity
train_iblm_xgb(
df_list,
response_var,
family = "poisson",
params = list(),
nrounds = 1000,
objective = NULL,
custom_metric = NULL,
verbose = 0,
print_every_n = 1L,
early_stopping_rounds = 25,
maximize = NULL,
save_period = NULL,
save_name = "xgboost.model",
xgb_model = NULL,
callbacks = list(),
...,
strip_glm = TRUE
)An object of class "iblm" containing:
The GLM model object, fitted on the `df_list$train` data that was provided
The booster model object, trained on the residuals leftover from the glm_model
A list containing the data that was used to train and validate this iblm model
String that explains how to combine the `glm_model` and `booster_model`. Currently only either "Additive" or "Multiplicative"
A string describing the response variable used for this iblm model
A list describing the predictor variables used for this iblm model
A list describing the categorical levels for the predictor vars
A list describing the coefficient names
A named list containing training and validation datasets. Must have elements named "train" and "validate", each containing df_list frames with the same structure. This item is naturally output from the function [split_into_train_validate_test()]
Character string specifying the name of the response variable column in the datasets. The string MUST appear in both `df_list$train` and `df_list$validate`.
Character string specifying the distributional family for the model. Currently only "poisson", "gamma", "tweedie" and "gaussian" is fully supported. See details for how this impacts fitting.
Named list of additional parameters to pass to xgb.train. Note that train_iblm_xgb will select "objective" and "base_score" for you depending on `family` (see details section). However you may overwrite these (do so with caution)
These are passed directly to xgb.train
TRUE/FALSE, whether to strip superfluous data from the `glm_model` object saved within `iblm` class that is output. Only serves to reduce memory constraints.
The `family` argument will be fed into the GLM fitting. Default `params` values for the XGBoost fitting are also selected based on family:
For "poisson" family, the "objective" is set to "count:poisson"
For "gamma" family, the "objective" is set to "reg:gamma"
For "tweedie" family, the "objective" is set to "reg:tweedie". Also, "tweedie_variance_power = 1.5".
For "gaussian" family, the "objective" is set to "reg:squarederror"
Note: Any xgboost configuration below will be overwritten by any explicit arguments input into `train_iblm_xgb()`
glm, xgb.train
df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)
iblm_model <- train_iblm_xgb(
df_list,
response_var = "ClaimRate",
family = "poisson"
)
Run the code above in your browser using DataLab