This function is an R wrapper for the explainable boosting functions in the Python interpret library. It trains an Explainable Boosting Machine (EBM) model, which is a tree-based, cyclic gradient boosting generalized additive model with automatic interaction detection. EBMs are often as accurate as state-of-the-art blackbox models while remaining completely interpretable.
ebm(
formula,
data,
max_bins = 1024L,
max_interaction_bins = 64L,
interactions = 0.9,
exclude = NULL,
validation_size = 0.15,
outer_bags = 16L,
inner_bags = 0L,
learning_rate = 0.04,
greedy_ratio = 10,
cyclic_progress = FALSE,
smoothing_rounds = 500L,
interaction_smoothing_rounds = 100L,
max_rounds = 25000L,
early_stopping_rounds = 100L,
early_stopping_tolerance = 1e-05,
min_samples_leaf = 4L,
min_hessian = 0,
reg_alpha = 0,
reg_lambda = 0,
max_delta_step = 0,
gain_scale = 5,
min_cat_samples = 10L,
cat_smooth = 10,
missing = "separate",
max_leaves = 2L,
monotone_constraints = NULL,
objective = c("auto", "log_loss", "rmse", "poisson_deviance",
"tweedie_deviance:variance_power=1.5", "gamma_deviance", "pseudo_huber:delta=1.0",
"rmse_log"),
n_jobs = -1L,
random_state = 42L,
...
)
An object of class "EBM"
for which there are print,
predict, plot, and merge methods.
A formula of the form y ~ x1 + x2 + ...
.
A data frame containing the variables in the model.
Max number of bins per feature for the main effects stage. Default is 1024.
Max number of bins per feature for interaction terms. Default is 64.
Interaction terms to be included in the model. Default is 0.9. Current options include:
Integer (1 <= interactions): Count of interactions to be automatically selected.
Percentage (interactions < 1.0): Determine the integer count of interactions by multiplying the number of features by this percentage.
List of numeric pairs: The pairs contain the indices of the features within
each additive term. In addition to pairs, the interactions parameter accepts
higher order interactions. It also accepts univariate terms which will cause
the algorithm to boost the main terms at the same time as the interactions.
When boosting mains at the same time as interactions, the exclude
parameter
should be set to "mains"
and currently max_bins
needs to be equal to
max_interaction_bins
.
Features or terms to be excluded. Default is NULL
.
Validation set size. Used for early stopping during boosting, and is needed to create outer bags. Default is 0.15. Options are:
Integer (1 <= validation_size
): Count of samples to put in the validation
sets.
Percentage (validation_size
< 1.0): Percentage of the data to put in the
validation sets.
0: Turns off early stopping. Outer bags have no utility. Error bounds will
Number of outer bags. Outer bags are used to generate error bounds and help with smoothing the graphs.
Number of inner bags. Default is 0 which turns off inner bagging.
Learning rate for boosting. Deafult is 0.04.
The proportion of greedy boosting steps relative to cyclic boosting steps. A value of 0 disables greedy boosting, effectively turning it off. Default is 10.
This parameter specifies the proportion of the
boosting cycles that will actively contribute to improving the model's
performance. It is expressed as a logical or numeric between 0 and 1, with
the default set to TRUE
(1.0), meaning 100% of the cycles are expected to
make forward progress. If forward progress is not achieved during a cycle,
that cycle will not be wasted; instead, it will be used to update internal
gain calculations related to how effective each feature is in predicting the
target variable. Setting this parameter to a value less than 1.0 can be
useful for preventing overfitting. Default is FALSE
.
Number of initial highly regularized rounds to set the basic shape of the main effect feature graphs. Default is 500.
Number of initial highly regularized rounds to set the basic shape of the interaction effect feature graphs during fitting. Default is 100.
Total number of boosting rounds with n_terms
boosting
steps per round. Default is 25000.
Number of rounds with no improvement to trigger
early stopping. 0 turns off early stopping and boosting will occur for
exactly max_rounds
. Default is 100.
Tolerance that dictates the smallest delta
required to be considered an improvement which prevents the algorithm from
early stopping. early_stopping_tolerance
is expressed as a percentage of
the early stopping metric. Negative values indicate that the individual
models should be overfit before stopping. EBMs are a bagged ensemble of
models. Setting the early_stopping_tolerance
to zero (or even negative),
allows learning to overfit each of the individual models a little, which can
improve the accuracy of the ensemble as a whole. Overfitting each of the
individual models reduces the bias of each model at the expense of increasing
the variance (due to overfitting) of the individual models. But averaging the
models in the ensemble reduces variance without much change in bias. Since
the goal is to find the optimum bias-variance tradeoff for the ensemble of
models---not the individual models---a small amount of overfitting of the
individual models can improve the accuracy of the ensemble as a whole.
Default is 1e-05.
Minimum number of samples allowed in the leaves. Default is 4.
Minimum hessian required to consider a potential split valid. Default is 0.0.
L1 regularization. Default is 0.0.
L2 regularization. Default is 0.0.
Used to limit the max output of tree leaves; <=0.0 means no constraint. Default is 0.0.
Scale factor to apply to nominal categoricals. A scale factor above 1.0 will cause the algorithm focus more on the nominal categoricals. Default is 5.0.
Minimum number of samples in order to treat a category separately. If lower than this threshold the category is combined with other categories that have low numbers of samples. Default is 10.
Used for the categorical features. This can reduce the effect of noises in categorical features, especially for categories with limited data. Default is 10.0.
Method for handling missing values during boosting. Default is
"separate"
. The placement of the missing value bin can influence the
resulting model graphs. For example, placing the bin on the "low" side may
cause missing values to affect lower bins, and vice versa. This parameter
does not affect the final placement of the missing bin in the model (the
missing bin will remain at index 0 in the term_scores_
attribute). Possible
values for missing are:
"low"
: Place the missing bin on the left side of the graphs.
"high"
: Place the missing bin on the right side of the graphs.
"separate"
: Place the missing bin in its own leaf during each boosting
step, effectively making it location-agnostic. This can lead to overfitting,
especially when the proportion of missing values is small.
"gain"
: Choose the best leaf for the missing value contribution at each
boosting step, based on gain.
Maximum number of leaves allowed in each tree. Default is 2.
Default is NULL. This parameter allows you to
specify monotonic constraints for each feature's relationship with the target
variable during model fitting. However, it is generally recommended to apply
monotonic constraints post-fit using the monotonize()
attribute rather than
setting them during the fitting process. This recommendation is based on the
observation that, during fitting, the boosting algorithm may compensate for a
monotone constraint on one feature by utilizing another correlated feature,
potentially obscuring any monotonic violations. If you choose to define
monotone constraints, monotone_constraints
should be a numeric vector with
a length equal to the number of features. Each element in the list
corresponds to a feature and should take one of the following values:
0: No monotonic constraint is imposed on the corresponding feature's partial response.
+1: The partial response of the corresponding feature should be monotonically increasing with respect to the target.
-1: The partial response of the corresponding feature should be monotonically decreasing with respect to the target.
The objective function to optimize. Current options include:
"auto"
(try to determine automatically between "log_loss"
and "rmse"
).
"rmse"
(root mean squared error).
"poisson_deviance"
(e.g., for counts or non-negative integers).
"tweedie_deviance:variance_power=1.5"
(e.g., for modeling total loss in
insurance applications).
"gamma_deviance"
(e.g., for positive continuous response).
"pseudo_huber:delta=1.0"
(e.g., for robust regression).
"rmse_log"
("rmse"
with a log link function).
Default is "auto"
which assumes "log_loss"
if the response is a factor or
character string and "rmse"
otherwise. It's a good idea to always
explicitly set this argument.
Number of jobs to run in parallel. Default is -1. Negative
integers are interpreted as following
joblib's formula (n_cpus + 1 + n_jobs
),
just like scikit-learn. For example,
n_jobs = -2
means using all threads except 1.
Random state. Setting to NULL
generates non-repeatable
sequences. Default is 42 to remain consistent with the corresponding Python
module.
Additional optional argument. (Currently ignored.)
In short, EBMs have the general form
$$E\left[g\left(Y|\boldsymbol{x}\right)\right] = \theta_0 + \sum_if_i\left(x_i\right) + \sum_{ij}f_{ij}\left(x_i, x_j\right) \quad \left(i \ne j\right),$$
where,
\(g\) is a link function that allows the model to handle various response types (e.g., the logit link for logistic regression or Poisson deviance for modeling counts and rates);
\(\theta_0\) is a constant intercept (or bias term); ?
\(f_i\) is the term contribution (or shape function) for predictor \(x_i\) (i.e., it captures the main effect of \(x_i\) on \(E\left[Y|\boldsymbol{x}\right]\));
\(f_{ij}\) is the term contribution for the pair of predictors \(x_i\) and \(x_j\) (i.e., it captures the joint effect, or pairwise interaction effect of \(x_i\) and \(x_j\) on \(E\left[Y|\boldsymbol{x}\right]\)).
if (FALSE) {
#
# Regression example
#
# Fit a default EBM regressor
fit <- ebm(mpg ~ ., data = mtcars, objective = "rmse")
# Generate some predictions
head(predict(fit, newdata = mtcars))
head(predict(fit, newdata = mtcars, se_fit = TRUE))
# Show global summary and GAM shape functions
plot(fit) # term importance scores
plot(fit, term = "cyl")
plot(fit, term = "cyl", interactive = TRUE)
# Explain prediction for first observation
plot(fit, local = TRUE, X = subset(mtcars, select = -mpg)[1L, ])
}
Run the code above in your browser using DataLab