np_glm_b: Non-parametric linear models

Description

np_glm_b uses general Bayesian inference with loss-likelihood bootstrap. This is, as implemented here, a Bayesian non-parametric linear models inferential engine. Applicable data types are continuous (use family = gaussian()), count (use family = poisson()), or binomial (use family = binomial()).

Usage

np_glm_b(
  formula,
  data,
  family,
  loss = "selfinformation",
  loss_gradient,
  trials,
  n_draws,
  ask_before_full_sampling = TRUE,
  CI_level = 0.95,
  ROPE,
  seed = 1,
  mc_error = 0.01
)

Value

np_glm_b() returns an object of class "np_glm_b", which behaves as a list with the following elements:

summary - a tibble giving results for regression coefficients.

Arguments

formula: A formula specifying the model.
data: A data frame in which the variables specified in the formula will be found. If missing, the variables are searched for in the standard way. However, it is strongly recommended that you use this argument so that other generics for bayesics objects work correctly.
family: A description of the error distribution and link function to be used in the model. See ?glm for more information. Currently implemented families are binomial(), poisson(), negbinom(), and gaussian() (this last acts as a wrapper for
loss: Either "selfinformation", or a function that takes in two arguments, the first of which should be the vector of outcomes and the second should be the expected value of y; The outcome of the function should be the loss evaluated for each observation. By default, the self-information loss is used (i.e., the negative log-likelihood). Note: I really do mean the expected value of y, even for binomial (i.e., n*p). If family = negbinom(), then a user-supplied loss function should take three arguments: y, mu, and phi, where phi is the dispersion parameter (i.e., $\text{Var}(y) = \mu + \mu^2/\phi$).
loss_gradient: If loss is a user-defined function (as opposed to "selfinformation"), supplying the gradient to the loss will speed up the algorithm.
trials: Integer vector giving the number of trials for each observation if family = binomial().
n_draws: integer. Number of posterior draws to obtain. If left missing, the large sample approximation will be used.
ask_before_full_sampling: logical. If TRUE, the user will be asked to specify whether they wish to commit to getting the full number of posterior draws to obtain precise credible interval bounds. Defaults to TRUE because the bootstrap is computationally intensive. Also, parallelization via future::plan is highly recommended for full sample.
CI_level: numeric. Credible interval level.
ROPE: vector of positive values giving ROPE boundaries for each regression coefficient. Optionally, you can not include a ROPE boundary for the intercept. If missing, defaults go to those suggested by Kruchke (2018).
seed: integer. Always set your seed!!!
mc_error: If large sample approximation is not used, the number of posterior draws will ensure that with 99% probability the bounds of the credible intervals will be within $\pm$ mc_error.

Details

Consider a population parameter of interest defined in terms of minimizing a loss function $\ell$ wrt the population distribution: $$ \theta(F_y) := \underset{\theta\in\Theta}{\text{argmax}} \int \ell(\theta,y)dF_y $$ If we use a non-parametric Dirichlet process prior on the distribution of $y$, $F_y$, and let the concentration parameter go to zero, we have the Bayesian bootstrap applied to a general Bayesian updating framework dictated by the loss function.

By default, the loss function is the self-information loss, i.e., the negative log likelihood. This then resembles a typical glm_b implementation, but is more robust to model misspecification.

References

S P Lyddon, C C Holmes, S G Walker, General Bayesian updating and the loss-likelihood bootstrap, Biometrika, Volume 106, Issue 2, June 2019, Pages 465–478, https://doi.org/10.1093/biomet/asz006

Examples

Run this code

# \donttest{
# Generate some data
set.seed(2025)
N = 500
test_data = 
  data.frame(x1 = rnorm(N),
             x2 = rnorm(N),
             x3 = letters[1:5])
test_data$outcome = 
  rbinom(N,1,1.0 / (1.0 + exp(-(-2 + test_data$x1 + 2 * (test_data$x3 %in% c("d","e")) ))))

# Fit the GLM via the (non-parametric) loss-likelihood bootstrap.
fit1 <-
  np_glm_b(outcome ~ x1 + x2 + x3,
           data = test_data,
           family = binomial())
fit1
summary(fit1,
        CI_level = 0.99)
plot(fit1)
coef(fit1)
credint(fit1)
predict(fit1,
        newdata = fit1$data[1,])
vcov(fit1)
# }

Run the code above in your browser using DataLab