bayesreg: Fitting Bayesian Regression Models with Continuous Shrinkage Priors

Description

Fit a linear or logistic regression model using Bayesian continuous shrinkage prior distributions. Handles ridge, lasso, horseshoe and horseshoe+ regression with logistic, Gaussian, Laplace, Student-t, Poisson or geometric distributed targets. See bayesreg-package for more details on the features available in this package.

Usage

bayesreg(
  formula,
  data,
  model = "normal",
  prior = "ridge",
  n.samples = 1000,
  burnin = 1000,
  thin = 5,
  t.dof = 5,
  n.cores = Inf
)

Value

An object with S3 class "bayesreg" containing the results of the sampling process, plus some additional information.

beta: Posterior samples the regression model coefficients.
beta0: Posterior samples of the intercept parameter.
sigma2: Posterior samples of the square of the scale parameter; for Gaussian distributed targets this is equal to the variance. For binary targets this is empty.
mu.beta: The mean of the posterior samples for the regression coefficients.
mu.beta0: The mean of the posterior samples for the intercept parameter.
mu.sigma2: The mean of the posterior samples for squared scale parameter.
tau2: Posterior samples of the global shrinkage parameter.
t.stat: Posterior t-statistics for each regression coefficient.
var.ranks: Ranking of the covariates by their importance, with "1" denoting the most important covariate.
log.l: The log-likelihood at the posterior means of the model parameters
waic: The Widely Applicable Information Criterion (WAIC) score for the model
waic.dof: The effective degrees-of-freedom of the model, as estimated by the WAIC.

The returned object also stores the parameters/options used to run bayesreg:

formula: The object of type "formula" describing the fitted model.
model: The distribution of the target (y) variable.
prior: The shrinkage prior used to fit the model.
n.samples: The number of samples generated from the posterior distribution.
burnin: The number of burnin samples that were generated.
thin: The level of thinning.
n: The sample size of the data used to fit the model.
p: The number of covariates in the fitted model.
n.cores: The number of cores actually used during sampling.

Arguments

formula: An object of class "formula": a symbolic description of the model to be fitted using the standard R formula notation.
data: A data frame containing the variables in the model.
model: The distribution of the target (y) variable. Continuous or numeric variables can be distributed as per a Gaussian distribution (model="gaussian" or model="normal"), Laplace distribution (model = "laplace" or model = "l1") or Student-t distribution ("model" = "studentt" or "model" = "t"). Integer or count data can be distributed as per a Poisson distribution (model="poisson") or geometric distribution (model="geometric"). For binary targets (factors with two levels) either model="logistic" or "model"="binomial" should be used.
prior: Which continuous shrinkage prior distribution over the regression coefficients to use. Options include ridge regression (prior="rr" or prior="ridge"), lasso regression (prior="lasso"), horseshoe regression (prior="hs" or prior="horseshoe") and horseshoe+ regression (prior="hs+" or prior="horseshoe+")
n.samples: Number of posterior samples to generate.
burnin: Number of burn-in samples.
thin: Desired level of thinning.
t.dof: Degrees of freedom for the Student-t distribution.
n.cores: Maximum number of cores to use; if n.cores=Inf then Bayesreg automatically sets this to one less than the maximum number of available cores. If n.cores=1 then no parallelization occurs.

Details

Draws a series of samples from the posterior distribution of a linear (Gaussian, Laplace or Student-t) or generalized linear (logistic binary, Poisson, geometric) regression model with specified continuous shrinkage prior distribution (ridge regression, lasso, horseshoe and horseshoe+) using Gibbs sampling. The intercept parameter is always included, and is never penalised.

While only n.samples are returned, the total number of samples generated is equal to burnin+n.samples*thin. To generate the samples of the regression coefficients, the code will use either Rue's algorithm (when the number of samples is twice the number of covariates) or the algorithm of Bhattacharya et al. as appropriate. Factor variables are automatically grouped together and additional shrinkage is applied to the set of indicator variables to which they expand.

If n.cores > 1 then Bayesreg will divide the total sampling process between a number of independent sampling chains, with each chain being run on a seperate core. If n.cores=Inf then Bayesreg will use one less than the total number of available cores. If n.cores is a positive integer, then this is the maximum number of cores Bayesreg will use, though Bayesreg will cap this at the total number of available cores minus one. The total number of samples are split evenly between the number of cores being utilised, and after sampling the different chains are recombined into a single body of samples. Each chain will perform burnin burn-in sampling iterations before collecting samples, so the improvement in speed from using from multiple cores will be greater if n.samples*thin is substantially greater than burnin.

References

Makalic, E. & Schmidt, D. F. High-Dimensional Bayesian Regularised Regression with the BayesReg Package arXiv:1611.06649 [stat.CO], 2016 http://arxiv.org/pdf/1611.06649

Park, T. & Casella, G. The Bayesian Lasso Journal of the American Statistical Association, Vol. 103, pp. 681-686, 2008

Carvalho, C. M.; Polson, N. G. & Scott, J. G. The horseshoe estimator for sparse signals Biometrika, Vol. 97, 465-480, 2010

Makalic, E. & Schmidt, D. F. A Simple Sampler for the Horseshoe Estimator IEEE Signal Processing Letters, Vol. 23, pp. 179-182, 2016 http://arxiv.org/pdf/1508.03884v4

Bhadra, A.; Datta, J.; Polson, N. G. & Willard, B. The Horseshoe+ Estimator of Ultra-Sparse Signals Bayesian Analysis, 2016

Polson, N. G.; Scott, J. G. & Windle, J. Bayesian inference for logistic models using Polya-Gamma latent variables Journal of the American Statistical Association, Vol. 108, 1339-1349, 2013

Rue, H. Fast sampling of Gaussian Markov random fields Journal of the Royal Statistical Society (Series B), Vol. 63, 325-338, 2001

Bhattacharya, A.; Chakraborty, A. & Mallick, B. K. Fast sampling with Gaussian scale-mixture priors in high-dimensional regression arXiv:1506.04778, 2016

Schmidt, D.F. & Makalic, E. Bayesian Generalized Horseshoe Estimation of Generalized Linear Models ECML PKDD 2019: Machine Learning and Knowledge Discovery in Databases. pp 598-613, 2019

Stan Development Team, Stan Reference Manual (Version 2.26), Section 15.4, "Effective Sample Size", https://mc-stan.org/docs/2_18/reference-manual/effective-sample-size-section.html

Examples

Run this code

# -----------------------------------------------------------------
# By default Bayesreg now utilizes multiple cores if you have them
# available. If you do not want to use multiple cores you can set 
# n.cores=1 when calling Bayesreg.
#
# In most realistic/practical settings, i.e., when a 
# even a moderate number of samples is being requested, parallelization 
# will usually result in substantial speedups, and can dramatically reduce
# run-time for large numbers of samples.
#
# However, if your design matrix and desired number of samples 
# are small (i.e, small nrow(X), ncol(X) and n.samples) 
# then sometimes no parallelization, or parallelization with a small
# number of cores, can be quicker due to the overhead of setting up 
# multiple threads. This is likely only relevant if you are calling Bayesreg
# hundreds/thousands of times with *small* numbers of samples being requested
# for each call, e.g., if you are doing some sort of one-at-a-time testing
# across many predictors such as a genome-wide association test, or similar.
# In this case you may be better off parallelizing the calls to 
# Bayesreg across the tests and disabling parallelization when calling 
# Bayesreg (i.e., n.cores=1).
#
# -----------------------------------------------------------------
# Example 1: Gaussian regression
X = matrix(rnorm(100*20),100,20)
b = matrix(0,20,1)
b[1:5] = c(5,4,3,2,1)
y = X %*% b + rnorm(100, 0, 1)

df <- data.frame(X,y)
rv.lm <- lm(y~.,df)                        # Regular least-squares
summary(rv.lm)

# Horseshoe regression -- here we show how to explicitly control the maximum
# number of cores being used for sampling to 4
rv.hs <- bayesreg(y~., df, prior="hs", n.cores=4)       
rv.hs$n.cores  # actual number of cores used (will be <= 4, depending on machine)
rv.hs.s <- summary(rv.hs)

# Expected squared prediction error for least-squares
coef_ls = coef(rv.lm)
as.numeric(sum( (as.matrix(coef_ls[-1]) - b)^2 ) + coef_ls[1]^2)

# Expected squared prediction error for horseshoe
as.numeric(sum( (rv.hs$mu.beta - b)^2 ) + rv.hs$mu.beta0^2)


# -----------------------------------------------------------------
# Example 2: Gaussian v Student-t robust regression
X = 1:10;
y = c(-0.6867, 1.7258, 1.9117, 6.1832, 5.3636, 7.1139, 9.5668, 10.0593, 11.4044, 6.1677);
df = data.frame(X,y)

# Gaussian ridge
rv.G <- bayesreg(y~., df, model = "gaussian", prior = "ridge", n.samples = 1e3)

# Student-t ridge
rv.t <- bayesreg(y~., df, model = "t", prior = "ridge", t.dof = 5, n.samples = 1e3)

# Plot the different estimates with credible intervals
plot(df$X, df$y, xlab="x", ylab="y")

yhat_G <- predict(rv.G, df, bayes.avg=TRUE)
lines(df$X, yhat_G[,1], col="blue", lwd=2.5)
lines(df$X, yhat_G[,3], col="blue", lwd=1, lty="dashed")
lines(df$X, yhat_G[,4], col="blue", lwd=1, lty="dashed")

yhat_t <- predict(rv.t, df, bayes.avg=TRUE)
lines(df$X, yhat_t[,1], col="darkred", lwd=2.5)
lines(df$X, yhat_t[,3], col="darkred", lwd=1, lty="dashed")
lines(df$X, yhat_t[,4], col="darkred", lwd=1, lty="dashed")

legend(1,11,c("Gaussian","Student-t (dof=5)"),lty=c(1,1),col=c("blue","darkred"),
       lwd=c(2.5,2.5), cex=0.7)

if (FALSE) {
# -----------------------------------------------------------------
# Example 3: Poisson/geometric regression example

X  = matrix(rnorm(100*5),100,5)
b  = c(0.5,-1,0,0,1)
nu = X%*%b + 1
y  = rpois(lambda=exp(nu),n=length(nu))

df <- data.frame(X,y)

# Fit a Poisson regression
rv.pois=bayesreg(y~.,data=df,model="poisson",prior="hs", burnin=1e4, n.samples=5e4)
summary(rv.pois)

# Fit a geometric regression
rv.geo=bayesreg(y~.,data=df,model="geometric",prior="hs", burnin=1e4, n.samples=5e4)
summary(rv.geo)

# Compare the two models in terms of their WAIC scores
cat(sprintf("Poisson regression WAIC=%g vs geometric regression WAIC=%g", 
            rv.pois$waic, rv.geo$waic))
# Poisson is clearly preferred to geometric, which is good as data is generated from a Poisson!
 
 
# -----------------------------------------------------------------
# Example 4: Logistic regression on spambase
data(spambase)
  
# bayesreg expects binary targets to be factors
spambase$is.spam <- factor(spambase$is.spam)

# First take a subset of the data (1/10th) for training, reserve the rest for testing
spambase.tr  = spambase[seq(1,nrow(spambase),10),]
spambase.tst = spambase[-seq(1,nrow(spambase),10),]
  
# Fit a model using logistic horseshoe for 2,000 samples
# In practice, > 10,000 samples would be a more realistic amount to draw
rv <- bayesreg(is.spam ~ ., spambase.tr, model = "logistic", prior = "horseshoe", n.samples = 2e3)
  
# Summarise, sorting variables by their ranking importance
rv.s <- summary(rv,sort.rank=TRUE)
  
# Make predictions about testing data -- get class predictions and class probabilities
y_pred <- predict(rv, spambase.tst, type='class')
  
# Check how well did our predictions did by generating confusion matrix
table(y_pred, spambase.tst$is.spam)
  
# Calculate logarithmic loss on test data
y_prob <- predict(rv, spambase.tst, type='prob')
cat('Neg Log-Like for no Bayes average, posterior mean estimates: ', sum(-log(y_prob[,1])), '\n')
y_prob <- predict(rv, spambase.tst, type='prob', sum.stat="median")
cat('Neg Log-Like for no Bayes average, posterior median estimates: ', sum(-log(y_prob[,1])), '\n')
y_prob <- predict(rv, spambase.tst, type='prob', bayes.avg=TRUE)
cat('Neg Log-Like for Bayes average: ', sum(-log(y_prob[,1])), '\n')
}

Run the code above in your browser using DataLab