Calculating p-values is not trivial for ALE statistics because ALE is
non-parametric and model-agnostic. Because ALE is non-parametric (that is,
it does not assume any particular distribution of data), the ale
package
generates p-values by calculating ALE for many random variables; this makes the
procedure somewhat slow. For this reason, they are not calculated by default;
they must be explicitly requested. Because the ale
package is model-agnostic (that is, it
works with any kind of R model), the ale()
function cannot always automatically
manipulate the model object to create the p-values. It can only do so for
models that follow the standard R statistical modelling conventions, which
includes almost all built-in R algorithms (like stats::lm()
and stats::glm()
) and many widely
used statistics packages (like mgcv
and survival
), but which excludes most
machine learning algorithms (like tidymodels
and caret
). For non-standard
algorithms, the user needs to do a little work to help the ale function
correctly manipulate its model object:
The full model call must be passed as a character string in the argument 'random_model_call_string', with two slight modifications as follows.
In the formula that specifies the model, you must add a variable named
'random_variable'. This corresponds to the random variables that create_p_funs()
will use to estimate p-values.
The dataset on which the model is trained must be named 'rand_data'. This corresponds to the modified datasets that will be used to train the random variables.
See the example below for how this is implemented.
create_p_funs(
data,
model,
...,
parallel = parallel::detectCores(logical = FALSE) - 1,
model_packages = as.character(NA),
random_model_call_string = NULL,
random_model_call_string_vars = character(),
y_col = NULL,
pred_fun = function(object, newdata, type = pred_type) {
stats::predict(object =
object, newdata = newdata, type = type)
},
pred_type = "response",
rand_it = 1000,
silent = FALSE,
.testing_mode = FALSE
)
The return value is a list of class c('p_funs', 'ale', 'list'
) with an
ale_version
attribute whose value is the version of the ale
package used to
create the object. See examples for an illustration of how to inspect this list.
Its elements are:
value_to_p
: a list of functions named for each each available ALE statistic.
Each function signature is function(x)
where x is a numeric. The function returns
the p-value (minimum 0; maximum 1) for the respective statistic based on the random variable analysis.
For an input x that returns p, its interpretation is that p% of random variables
obtained the same or higher statistic value. For example, to get the p-value
of a NALED of 4.2, enter p_funs$value_to_p(4.2)
. A return value of 0.03 means
that only 3% of random variables obtained a NALED greater than or equal to 4.2.
p_to_random_value
: a list of functions named for each each available ALE statistic.
These are the inverse functions of value_to_p
. The signature is function(p)
where p is a numeric from 0 to 1. The function returns the numeric value of the
random variable statistic that would yield the provided p-value.
For an input p that returns x, its interpretation is that p% of random variables
obtained the same or higher statistic value. For example, to get the random
variable ALED for the 0.05 p-value, enter p_funs$p_to_random_value(0.05)
.
A return value of 102 means that only 5% of random variables obtained an ALED
greater than or equal to 102.
rand_stats
: a tibble whose rows are each of the rand_it
iterations of the
random variable analysis and whose columns are the ALE statistics obtained for
each random variable.
residuals
: the actual y_col
values from data
minus the predicted
values from the model
(without random variables) on the data
.
residual_distribution
: the closest estimated distribution for the residuals
as determined by univariateML::rml()
. This is the distribution used to generate
all the random variables.
See documentation for ale()
See documentation for ale()
not used. Inserted to require explicit naming of subsequent arguments.
See documentation for ale()
See documentation for ale()
character string. If NULL, create_p_funs()
tries to
automatically detect and construct the call for p-values. If it cannot, the
function will fail early. In that case, a character string of the full call
for the model must be provided that includes the random variable. See details.
See documentation for model_call_string_vars
in model_bootstrap()
; the operation is very similar.
See documentation for ale()
See documentation for ale()
.
non-negative integer length 1. Number of times that the model should be retrained with a new random variable. The default of 1000 should give reasonably stable p-values. It can be reduced as low as 100 for faster test runs.
See documentation for ale()
logical. Internal use only.
The ale
package takes a literal frequentist approach to the calculation of
p-values. That is, it literally retrains the model 1000 times, each time
modifying it by adding a distinct random variable to the model.
(The number of iterations is customizable
with the rand_it
argument.) The ALEs and ALE statistics are calculated for
each random variable. The percentiles of the distribution of these
random-variable ALEs are then used to determine p-values for non-random variables.
Thus, p-values are interpreted as the frequency of random variable ALE statistics
that exceed the value of ALE statistic of the actual variable in question.
The specific steps are as follows:
The residuals of the original model trained on the training data are calculated (residuals are the actual y target value minus the predicted values).
The closest distribution of the residuals is detected with
univariateML::model_select()
.
1000 new models are trained by generating a random variable each time with
univariateML::rml()
and then training a new model with that random variable
added.
The ALEs and ALE statistics are calculated for each random variable.
For each ALE statistic, the empirical cumulative distribution function
(from stats::ecdf()
) is used to create a function to determine p-values
according to the distribution of the random variables' ALE statistics.
Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.
# \donttest{
# Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example)
set.seed(0)
diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ]
# Create a GAM model with flexible curves to predict diamond price
# Smooth all numeric variables and include all other variables
gam_diamonds <- mgcv::gam(
price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
cut + color + clarity,
data = diamonds_sample
)
summary(gam_diamonds)
# Create p-value functions
pf_diamonds <- create_p_funs(
diamonds_sample,
gam_diamonds,
# only 100 iterations for a quick demo; but usually should remain at 1000
rand_it = 100,
)
# Examine the structure of the returned object
str(pf_diamonds)
# In RStudio: View(pf_diamonds)
# Calculate ALEs with p-values
ale_gam_diamonds <- ale(
diamonds_sample,
gam_diamonds,
p_values = pf_diamonds
)
# Plot the ALE data. The horizontal bands in the plots use the p-values.
ale_gam_diamonds$plots |>
patchwork::wrap_plots()
# For non-standard models that give errors with the default settings,
# you can use 'random_model_call_string' to specify a model for the estimation
# of p-values from random variables as in this example.
# See details above for an explanation.
pf_diamonds <- create_p_funs(
diamonds_sample,
gam_diamonds,
random_model_call_string = 'mgcv::gam(
price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
cut + color + clarity + random_variable,
data = rand_data
)',
# only 100 iterations for a quick demo; but usually should remain at 1000
rand_it = 100,
)
# }
Run the code above in your browser using DataLab