ebm: Explainable Boosting Machine (EBM)

Description

This function is an R wrapper for the explainable boosting functions in the Python interpret library. It trains an Explainable Boosting Machine (EBM) model, which is a tree-based, cyclic gradient boosting generalized additive model with automatic interaction detection. EBMs are often as accurate as state-of-the-art blackbox models while remaining completely interpretable.

Usage

ebm(
  formula,
  data,
  max_bins = 1024L,
  max_interaction_bins = 64L,
  interactions = 0.9,
  exclude = NULL,
  validation_size = 0.15,
  outer_bags = 16L,
  inner_bags = 0L,
  learning_rate = 0.04,
  greedy_ratio = 10,
  cyclic_progress = FALSE,
  smoothing_rounds = 500L,
  interaction_smoothing_rounds = 100L,
  max_rounds = 25000L,
  early_stopping_rounds = 100L,
  early_stopping_tolerance = 1e-05,
  min_samples_leaf = 4L,
  min_hessian = 0,
  reg_alpha = 0,
  reg_lambda = 0,
  max_delta_step = 0,
  gain_scale = 5,
  min_cat_samples = 10L,
  cat_smooth = 10,
  missing = "separate",
  max_leaves = 2L,
  monotone_constraints = NULL,
  objective = c("auto", "log_loss", "rmse", "poisson_deviance",
    "tweedie_deviance:variance_power=1.5", "gamma_deviance", "pseudo_huber:delta=1.0",
    "rmse_log"),
  n_jobs = -1L,
  random_state = 42L,
  ...
)

Value

An object of class "EBM" for which there are print, predict, plot, and merge methods.

Arguments

formula

A formula of the form y ~ x1 + x2 + ....

data

A data frame containing the variables in the model.

max_bins

Max number of bins per feature for the main effects stage. Default is 1024.

max_interaction_bins

Max number of bins per feature for interaction terms. Default is 64.

interactions

Interaction terms to be included in the model. Default is 0.9. Current options include:

Integer (1 <= interactions): Count of interactions to be automatically selected.
Percentage (interactions < 1.0): Determine the integer count of interactions by multiplying the number of features by this percentage.
List of numeric pairs: The pairs contain the indices of the features within each additive term. In addition to pairs, the interactions parameter accepts higher order interactions. It also accepts univariate terms which will cause the algorithm to boost the main terms at the same time as the interactions. When boosting mains at the same time as interactions, the exclude parameter should be set to "mains" and currently max_bins needs to be equal to max_interaction_bins.

exclude

Features or terms to be excluded. Default is NULL.

validation_size

Validation set size. Used for early stopping during boosting, and is needed to create outer bags. Default is 0.15. Options are:

Integer (1 <= validation_size): Count of samples to put in the validation sets.
Percentage (validation_size < 1.0): Percentage of the data to put in the validation sets.
0: Turns off early stopping. Outer bags have no utility. Error bounds will

outer_bags

Number of outer bags. Outer bags are used to generate error bounds and help with smoothing the graphs.

inner_bags

Number of inner bags. Default is 0 which turns off inner bagging.

learning_rate

Learning rate for boosting. Deafult is 0.04.

greedy_ratio

The proportion of greedy boosting steps relative to cyclic boosting steps. A value of 0 disables greedy boosting, effectively turning it off. Default is 10.

cyclic_progress

This parameter specifies the proportion of the boosting cycles that will actively contribute to improving the model's performance. It is expressed as a logical or numeric between 0 and 1, with the default set to TRUE (1.0), meaning 100% of the cycles are expected to make forward progress. If forward progress is not achieved during a cycle, that cycle will not be wasted; instead, it will be used to update internal gain calculations related to how effective each feature is in predicting the target variable. Setting this parameter to a value less than 1.0 can be useful for preventing overfitting. Default is FALSE.

smoothing_rounds

Number of initial highly regularized rounds to set the basic shape of the main effect feature graphs. Default is 500.

interaction_smoothing_rounds

Number of initial highly regularized rounds to set the basic shape of the interaction effect feature graphs during fitting. Default is 100.

max_rounds

Total number of boosting rounds with n_terms boosting steps per round. Default is 25000.

early_stopping_rounds

Number of rounds with no improvement to trigger early stopping. 0 turns off early stopping and boosting will occur for exactly max_rounds. Default is 100.

early_stopping_tolerance

Tolerance that dictates the smallest delta required to be considered an improvement which prevents the algorithm from early stopping. early_stopping_tolerance is expressed as a percentage of the early stopping metric. Negative values indicate that the individual models should be overfit before stopping. EBMs are a bagged ensemble of models. Setting the early_stopping_tolerance to zero (or even negative), allows learning to overfit each of the individual models a little, which can improve the accuracy of the ensemble as a whole. Overfitting each of the individual models reduces the bias of each model at the expense of increasing the variance (due to overfitting) of the individual models. But averaging the models in the ensemble reduces variance without much change in bias. Since the goal is to find the optimum bias-variance tradeoff for the ensemble of models---not the individual models---a small amount of overfitting of the individual models can improve the accuracy of the ensemble as a whole. Default is 1e-05.

min_samples_leaf

Minimum number of samples allowed in the leaves. Default is 4.

min_hessian

Minimum hessian required to consider a potential split valid. Default is 0.0.

reg_alpha

L1 regularization. Default is 0.0.

reg_lambda

L2 regularization. Default is 0.0.

max_delta_step

Used to limit the max output of tree leaves; <=0.0 means no constraint. Default is 0.0.

gain_scale

Scale factor to apply to nominal categoricals. A scale factor above 1.0 will cause the algorithm focus more on the nominal categoricals. Default is 5.0.

min_cat_samples

Minimum number of samples in order to treat a category separately. If lower than this threshold the category is combined with other categories that have low numbers of samples. Default is 10.

cat_smooth

Used for the categorical features. This can reduce the effect of noises in categorical features, especially for categories with limited data. Default is 10.0.

missing

Method for handling missing values during boosting. Default is "separate". The placement of the missing value bin can influence the resulting model graphs. For example, placing the bin on the "low" side may cause missing values to affect lower bins, and vice versa. This parameter does not affect the final placement of the missing bin in the model (the missing bin will remain at index 0 in the term_scores_ attribute). Possible values for missing are:

"low": Place the missing bin on the left side of the graphs.
"high": Place the missing bin on the right side of the graphs.
"separate": Place the missing bin in its own leaf during each boosting step, effectively making it location-agnostic. This can lead to overfitting, especially when the proportion of missing values is small.
"gain": Choose the best leaf for the missing value contribution at each boosting step, based on gain.

max_leaves

Maximum number of leaves allowed in each tree. Default is 2.

monotone_constraints

Default is NULL. This parameter allows you to specify monotonic constraints for each feature's relationship with the target variable during model fitting. However, it is generally recommended to apply monotonic constraints post-fit using the monotonize() attribute rather than setting them during the fitting process. This recommendation is based on the observation that, during fitting, the boosting algorithm may compensate for a monotone constraint on one feature by utilizing another correlated feature, potentially obscuring any monotonic violations. If you choose to define monotone constraints, monotone_constraints should be a numeric vector with a length equal to the number of features. Each element in the list corresponds to a feature and should take one of the following values:

0: No monotonic constraint is imposed on the corresponding feature's partial response.
+1: The partial response of the corresponding feature should be monotonically increasing with respect to the target.
-1: The partial response of the corresponding feature should be monotonically decreasing with respect to the target.

objective

The objective function to optimize. Current options include:

"auto" (try to determine automatically between "log_loss" and "rmse").
"rmse" (root mean squared error).
"poisson_deviance" (e.g., for counts or non-negative integers).
"tweedie_deviance:variance_power=1.5" (e.g., for modeling total loss in insurance applications).
"gamma_deviance" (e.g., for positive continuous response).
"pseudo_huber:delta=1.0" (e.g., for robust regression).
"rmse_log" ("rmse" with a log link function).

Default is "auto" which assumes "log_loss" if the response is a factor or character string and "rmse" otherwise. It's a good idea to always explicitly set this argument.

n_jobs

Number of jobs to run in parallel. Default is -1. Negative integers are interpreted as following joblib's formula (n_cpus + 1 + n_jobs), just like scikit-learn. For example, n_jobs = -2 means using all threads except 1.

random_state

Random state. Setting to NULL generates non-repeatable sequences. Default is 42 to remain consistent with the corresponding Python module.

...

Additional optional argument. (Currently ignored.)

Details

In short, EBMs have the general form

$$E\left[g\left(Y|\boldsymbol{x}\right)\right] = \theta_0 + \sum_if_i\left(x_i\right) + \sum_{ij}f_{ij}\left(x_i, x_j\right) \quad \left(i \ne j\right),$$

where,

$g$ is a link function that allows the model to handle various response types (e.g., the logit link for logistic regression or Poisson deviance for modeling counts and rates);
$\theta_0$ is a constant intercept (or bias term); ?
$f_i$ is the term contribution (or shape function) for predictor $x_i$ (i.e., it captures the main effect of $x_i$ on $E\left[Y|\boldsymbol{x}\right]$);
$f_{ij}$ is the term contribution for the pair of predictors $x_i$ and $x_j$ (i.e., it captures the joint effect, or pairwise interaction effect of $x_i$ and $x_j$ on $E\left[Y|\boldsymbol{x}\right]$).

Examples

Run this code

if (FALSE) {
  #
  # Regression example
  #

  # Fit a default EBM regressor
  fit <- ebm(mpg ~ ., data = mtcars, objective = "rmse")

  # Generate some predictions
  head(predict(fit, newdata = mtcars))
  head(predict(fit, newdata = mtcars, se_fit = TRUE))

  # Show global summary and GAM shape functions
  plot(fit)  # term importance scores
  plot(fit, term = "cyl")
  plot(fit, term = "cyl", interactive = TRUE)

  # Explain prediction for first observation
  plot(fit, local = TRUE, X = subset(mtcars, select = -mpg)[1L, ])
}

Run the code above in your browser using DataLab