nbp: Normal-Beta Prime Regression

Description

This function implements the Monte Carlo EM (Gibbs sampling) approach for the normal-beta prime (NBP) model in the standard linear regression model,

$$y = X \beta + \epsilon,$$

where $\epsilon \sim N_n (0, \sigma^2 I_n)$. This is achieved by placing the normal-beta prime (NBP) prior of Bai and Ghosh (2019) on the coefficients of $\beta$. In the case where $p > n$, we utilize an efficient sampler from Bhattacharya et al. (2016) to reduce the computational cost of sampling from the full conditional density of $\beta$ to $O(n^2 p)$. The hyperparameters can be set deterministically by the user or they may be automatically selected by marginal maximum likelihood (MML).

Note that the Gibbs sampling implementation is slower than the variational Bayes (VB) function, nbp.VB, but nbp tends to give much more accurate point esimates and posterior approximations. It is recommended that the user use nbp.

Usage

nbp(X, y, method.hyperparameters=c("fixed","mml"), a=0.5, b=0.5, c=1e-5, d=1e-5,  
    selection=c("dss", "intervals"), max.steps = 15000, burnin=10000)

Arguments

$n \times p$ design matrix. Should be centered.

$n \times 1$ response vector. Should be centered.

method.hyperparameters

The method for estimating the shape parameters $(a,b)$. If "mml" is chosen, the function estimates the marginal maximum likelihood (MML) estimates of $(a,b)$ by the EM algorithm described in Bai and Ghosh (2019). If "fixed" is chosen, then $(a,b)$ are not estimated from the data.

Shape parameter for $\beta'(a,b)$. The default is 0.5. The user may specify a different value for a ($a>0$). This is ignored if the method for estimating hyperparameters is "mml".

Shape parameter for $\beta'(a,b)$. The default is 0.5. The user may specify a different value for b ($b>0$). This is ignored if the method for estimating hyperparameters is "mml".

The shape parameter for the $IG(c,d)$ prior on unknown variance parameter, $\sigma^2$. The default is $10^{-5}$.

The rate parameter for the $IG(c,d)$ prior on the unknown variance parameter, $\sigma^2$. The default is $10^{-5}$.

selection

The method of variable selection. "dss" implements the decoupled selection and shrinkage (DSS) method of Hahn and Carvalho (2015) to select variables. "intervals" performs variable selection by examining the 95 percent posterior credible intervals for each coefficient, $\beta_i, i = 1, \ldots, p$.

max.steps

The total number of iterations to run in the Gibbs sampler. Defaults to 15,000.

burnin

The number of burn-in iterations for the Gibbs sampler. Defaults to 10,000.

Value

The function returns a list containing the following components:

beta.hat

The posterior mean estimate of $\beta$.

beta.med

The posterior median estimate of $\beta$.

beta.var

The posterior variance estimates for each $\beta_i, i=1, \ldots, p.$

beta.intervals

The 95 percent credible intervals for all $p$ estimates in $\beta$.

nbp.classifications

A $p$-dimensional binary vector with "1" if the covariate is selected and "0" if it is deemed irrelevant.

sigma2.esimate

Estimate of unknown variance component $\sigma^2$.

a.estimate

MML estimate of shape parameter $a$. If $a$ was fixed a priori, returns fixed $a$.

b.estimate

MML estimate of shape parameter $b$. If $b$ was fixed a prior, returns fixed $b$.

Details

The function implements the normal-beta prime (NBP) model of Bai and Ghosh (2019) using Gibbs sampling. The posterior variances and 95 percent credible intervals for each of the $p$ covariates are also returned so that the user may assess uncertainty quantification. The full model is:

$$Y | (X, \beta ) \sim N_n(X \beta, \sigma^2 I_n),$$ $$\beta_i | \omega_i^2 \sim N(0, \sigma^2 \omega_i^2), i = 1, ..., p,$$ $$\omega_i^2 \sim \beta'(a,b), i = 1, ..., p,$$ $$\sigma^2 \sim IG(c,d),$$

where $\beta'(a,b)$ denotes the beta prime density,

$$\pi(\omega_i^2) = \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} (\omega_i^2)^{a-1} (1+\omega_i^2)^{-a-b}. $$

References

Bai, R. and Ghosh, M. (2019). "On the beta prime prior for scale parameters in high-dimensional Bayesian regression models." Pre-print, arXiv:1807.06539.

Bhattacharya, A., Chakraborty, A., and Mallick, B.K. (2016). "Fast sampling with Gaussian scale mixture priors in high-dimensional regression." Biometrika, 69(2): 447-457.

Hahn, P. R. and Carvalho, C. M. (2015). "Decoupling shrinkage and selection in Bayesian linear models: A posterior summary perspective." Journal of the American Statistical Association, 110(509):435-448.

Examples

Run this code

# NOT RUN {
###############################
## Example on diabetes data. ## 
###############################
data(diabetes)
attach(diabetes)
X <- scale(diabetes$x) # Center and scale X
y <- scale(diabetes$y) # Center and scale y

################################
# Fit the NBP regression model #
################################
# Should use default of 15,000 for max.steps and 10,000 for burnin
nbp.model <- nbp(X=X, y=y, method.hyperparameters="mml", 
                max.steps=5000, burnin=2500, selection="dss")

nbp.model$beta.med     # posterior median estimates
nbp.model$a.estimate   # MML estimate of shape parameter 'a'
nbp.model$b.estimate   # MML estimate of shape parameter 'b'

nbp.model$beta.intervals         # 95 percent posterior credible intervals
nbp.model$nbp.classifications    # Variables selected

# }
# NOT RUN {
#
#
####################################################
# TRIM32 gene expression data analysis.            #
# Running this code will allow you to reproduce    #
# the results in Section 7 of Bai and Ghosh (2019) #
####################################################

# Load the data
data(eyedata)

# Set seed
set.seed(1)

# Center design matrix X and response vector y
X <- scale(genes, center=TRUE, scale=TRUE) # gene expression data (covariates)
y <- scale(trim32, center=TRUE, scale=TRUE) # levels of TRIM32 (response)
  
###########################
# Implement the NBP model #
###########################
nbp.model = nbp(X,y, method.hyperparameters="mml", selection="dss")

# Variables selected
active.indices <- which(nbp.model$nbp.classifications != 0) # active genes
active.estimates <- nbp.model$beta.med[active.indices]
active.CIs <- nbp.model$beta.intervals[, active.indices]


###################################
# Evaluate predictive performance #
###################################
k <- 5  # Number of folds

# Randomly select indices for the 5 different folds 
folds <- split(sample(nrow(X), nrow(X), replace=FALSE), as.factor(1:k)) 

# To store the mean square prediction error
mspe.nbp <- rep(NA, k)

for(i in 1:k){
  
  # Split data into training set and test set based on folds
  test_ind = folds[[i]]
  
  # 80 percent training set
  y.train = y[-test_ind]
  X.train = X[-test_ind, ]
  
  # Rest is test set 
  y.test = y[test_ind] 
  X.test = X[test_ind, ]
  
  # Run NBP model on training set
   nbp.train = nbp(X=X.train, y=y.train, method.hyperparameters="mml", selection="dss")
   beta.train <- nbp.train$beta.med
   
  # Obtain predictions on test set
  nbp.pred = crossprod(t(X.test), beta.train)
  
  # Compute the MSPE on test set
  mspe.nbp[i] = mean((nbp.pred - y.test)^2)
}

mean.nbp.mspe <- mean(mspe.nbp)
mean.nbp.mspe

#
#
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab