bayes.regress: MCMC posterior sampling of Bayesian linear regression model parameters using only summary statistics

Description

This function generates MCMC posterior samples of the Bayesian linear regression model parameters, using only summary statistics $X'X$, $X'Y$ and $Y'Y$ (e.g. calculated by the function read.regress.data.ff() in this package). The samples are generated according to the user specified choices of prior distributions, hyperprior distributions and fixed parameter values where required; the user also specifies starting values for unknown model parameters.

Usage

bayes.regress(data.values=NULL, 
              beta.prior=list("flat"), 
              sigmasq.prior=list("inverse.gamma", 1.0, 1.0, 1.0),
              Tsamp.out=1000, zero.intercept=FALSE)

Arguments

data.values

a list with four (optionally five) components, which are created by the function read.regress.data.ff() (in this package):

xtx: a square matrix that stores the product$X'X$, where$X$is the data from predictor columns

beta.prior

a list that specifies the characteristics of the prior distribution for $\beta$, the vector of coefficients of the Bayesian linear regression model. There are three possible types:

flat: Uniform distribution.
mvnorm.kn

Value

The returned value is a list containing the MCMC samples of the unknown Bayesian linear regression model parameters; the number of MCMC samples is equal to the argument Tsamp.out. Further analysis, including plotting and creating summary statistics, can be carried out using the 'coda' R package (see References).

item

sigmasq.prior
sigmasq.prior=list(type="sigmasq.inverse", sigmasq.init = ...).
- sigmasq.init: the initial value for the unknown$\sigma^2$parameter for the MCMC chain; default = 1.
Tsamp.out
zero.intercept
sigmasq.prior=list(type = "sigmasq.inverse"): $$\sigma^2 | \beta, X, Y \sim \mathrm{Inv{-}Gamma} \left(\frac{\mathrm{numsamp.data}}{2}, \left(\frac{1}{2}(Y'Y-\beta'X'Y-Y'X\beta+\beta'X'X\beta)\right)^{-1}\right)$$

eqn

$\beta$

itemize

sigmasq.prior=list(type = "inverse.gamma"):$$\sigma^2 | \beta, X, Y \sim \mathrm{Inv{-}Gamma} \left(\frac{\mathrm{numsamp.data}}{2}+a, \left(\frac{1}{2}(Y'Y-\beta'X'Y-Y'X\beta+\beta'X'X\beta)+1/b\right)^{-1}\right)$$

code

data.values$xty

Details

This function uses the following Bayesian linear regression model: $$y_i=x_i' \beta + \epsilon_i,$$ where $i = 1,...,\mathbf{numsamp.data}$; $\epsilon_i \sim N(0,\sigma^2)$; $k$ is the number of predictor variables. The function uses user-supplied prior distributions for $\beta$ and $\sigma^2$. The Gibbs sampler is used to sample from all full conditional posterior distributions, which only depend on the summary statistics $X'X$, $X'Y$ and $Y'Y$ (and $Y'X = (X'Y)'$); these summary statistics are calculated by the function read.regress.data.ff() (in this package), or can be provided by the user. Starting values are not needed for the vector $\beta$, since this vector is updated first, conditioned on all other unknown model parameters and the data.

The full conditional posterior distributions are the following for each prior specification of$\beta$; these depend on the data only through summary statistics$X'X$and$X'Y$:

beta.prior=list(type = "flat"):$$\beta | \sigma^2, X, Y \sim Normal_{k+1} (mean=((X'X)^{-1}(X'Y), covariance=(\sigma^2(X'X)^{-1})))$$

beta.prior=list(type = "mvnorm.known"): latex{$$\begin{array}{r@{\,}l}\beta | \sigma^2, X, Y \sim& Normal_{k+1} (mean=(C^{-1}+\sigma^{-2}(X'X))^{-1}(C^{-1}\mu + \sigma^{-2}X'Y),\ &covariance=(C^{-1}+\sigma^{-2}(X'X)^{-1}))\end{array}$$}{$$\beta | \sigma^2, X, Y \sim Normal_{k+1} (mean=(C^{-1}+\sigma^{-2}(X'X))^{-1}(C^{-1}\mu + \sigma^{-2}X'Y),covariance=(C^{-1}+\sigma^{-2}(X'X)^{-1}))$$} beta.prior=list(type = "mvnorm.unknown"): latex{ $$\begin{array}{r@{\,}l} \beta | \sigma^2, \mu, C^{-1}, X, Y \sim& Normal_{k+1} (mean=(C^{-1}+\sigma^{-2}(X'X))^{-1}(C^{-1}\mu + \sigma^{-2}X'Y),\\ &covariance=(C^{-1}+\sigma^{-2}(X'X)^{-1}))\\ \mu | \beta, \sigma^2, C^{-1}, X, Y \sim& Normal_{k+1} (mean=(D^{-1}+C^{-1})^{-1}(C^{-1}\beta+D^{-1}\eta), \\ &covariance=(D^{-1}+C^{-1})^{-1})\\ C^{-1} | \beta, \sigma^2, \mu, X, Y \sim& Wishart_{k+1} (d.f. = (1+\lambda), scale matrix = (V^{-1}+ (\beta - \mu)(\beta - \mu)')^{-1}) \end{array}$$}{$$\beta | \sigma^2, \mu, C^{-1}, X, Y ~ Normal_{k+1} (mean=(C^{-1}+\sigma^{-2}(X'X))^{-1}(C^{-1}\mu + \sigma^{-2}X'Y),covariance=(C^{-1}+\sigma^{-2}(X'X)^{-1}))$$ $$\mu | \beta, \sigma^2, C^{-1}, X, Y ~ Normal_{k+1} (mean=(D^{-1}+C^{-1})^{-1}(C^{-1}\beta+D^{-1}\eta), covariance=(D^{-1}+C^{-1})^{-1})$$ $$C^{-1} | \beta, \sigma^2, \mu, X, Y ~ Wishart_{k+1} (d.f. = (1+\lambda), scale matrix = (V^{-1}+ (\beta - \mu)(\beta - \mu)')^{-1})$$}

References

Carlin, B.P. and Louis, T.A. (2009) Bayesian Methods for Data Analysis, 3rd ed., Boca Raton, FL: Chapman and Hall/CRC Press. Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A. and Rubin, D.B. (2013) Bayesian Data Analysis, 3rd ed., Boca Raton, FL: Chapman and Hall/CRC Press. Plummer, M., Best, N., Cowles, K. and Vines, K. (2006) CODA: Convergence diagnosis and output analysis for MCMC. R News, 6(1), 7-11. Adler, D., Glaser, C., Nenadic, O., Oehlschlagel, J. and Zucchini, W. (2013) ff: memory-efficient storage of large data on disk and fast access functions. R package: http://CRAN.R-project.org/package=ff. Fasiolo, M. (2014) An introduction to mvnfast. R package: http://CRAN.R-project.org/package=mvnfast.

Examples

Run this code

##################################################
## Simulate data 
##################################################

set.seed(284698)

num.samp  <- 100 # number of data values to simulate

# The first value of the beta vector is the y-intercept:
beta <- c(-0.33, 0.78, -0.29, 0.47, -1.25)

# Calculate the number of predictor variables:
num.pred <- length(beta)-1

rho       <- 0.0  # correlation between predictors
mean.vec  <- rep(0,num.pred)
sigma.mat <- matrix(rho,num.pred,num.pred) + diag(1-rho,num.pred)
sigmasq.sim <- 0.05

# Simulate predictor variables:
x.pre       <- rmvn(num.samp, mu=mean.vec, sigma=sigma.mat)       

# Add leading column of 1's to x.pre for y-intercept:
x <- cbind(rep(1,num.samp),x.pre)

epsilon <- rnorm(num.samp, mean=0, sd=sqrt(sigmasq.sim))

y  <- as.numeric( x %*% as.matrix(beta) +  epsilon)

## Compute summary statistics (alternatively, the
# "read.regress.data.ff() function (in this package) can be 
# used to calculate summary statistics; see example below).

xtx <- t(x)%*%x 
xty <- t(x)%*%y 
yty <- t(y)%*%y 

data.values<-list(xtx=xtx, xty=xty, yty=yty,
                  numsamp.data = num.samp, 
                  xtx.inv = chol2inv(chol(xtx)))

##########################################################
## Bayesian linear regression analysis
##########################################################

Tsamp.out <- 100 # number of MCMC samples to produce

## Choose priors for beta and sigma-squared. Here,
# beta: Uniform prior; sigma-squared: Inverse Gamma prior. 

beta.prior    <- list( type = "flat")
sigmasq.prior <- list(type = "inverse.gamma", inverse.gamma.a = 1.0, 
                      inverse.gamma.b = 1.0, sigmasq.init = 1.0 )

set.seed(284698)

# Run the "bayes.regress" function using the data simulated above.

MCMC.out <- bayes.regress(data.values, 
                          beta.prior, 
                          sigmasq.prior = sigmasq.prior, 
                          Tsamp.out = Tsamp.out)

# Next, print the posterior means of the unknown model parameters.
# Alternatively, the "coda" package can be used for analysis.

print(c(colMeans(MCMC.out$beta), mean(MCMC.out$sigmasq)))

# Check that output is close to simulated values (although num.samp and
# Tsamp.out are small here); note that the output includes both beta and 
# sigmasq:
# c(-0.33,  0.78, -0.29,  0.47, -1.25,  0.05)

## Run all 6 combinations of priors for 3 "beta.prior" choices and 
#  2 "sigmasq.prior" choices:

beta.priors <- list(
  list( type = "flat"),
  
  list( type = "mvnorm.known", 
        mean.mu = rep(0.0,    (num.pred+1)), 
        prec.Cinv = diag(1.0, (num.pred+1))),
        
  list( type = "mvnorm.unknown",
        mu.hyper.mean.eta         = rep(0.0,(num.pred+1)),  
        mu.hyper.prec.Dinv        = diag(1.0, (num.pred+1)),  
        Cinv.hyper.df.lambda      = (num.pred+1), 
        Cinv.hyper.invscale.Vinv  = diag(1.0, (num.pred+1)),  
        mu.init                   = rep(1.0, (num.pred+1)), 
        Cinv.init                 = diag(1.0,(num.pred+1)))   
)

sigmasq.priors <- list(
  list(type = "inverse.gamma", 
       inverse.gamma.a = 1.0, 
       inverse.gamma.b = 1.0, 
       sigmasq.init = 0.1 ),       
  list( type="sigmasq.inverse", sigmasq.init = 0.1)
)

for (beta.prior in beta.priors)
{
  for(sigmasq.prior in sigmasq.priors)
  {
    set.seed(284698)
    MCMC.out <- bayes.regress(data.values, 
                              beta.prior, 
                              sigmasq.prior = sigmasq.prior, 
                              Tsamp.out = Tsamp.out)
    print(c(colMeans(MCMC.out$beta), mean(MCMC.out$sigmasq)))
  }
}

# Check that output is close to simulated values (although num.samp and
# Tsamp.out are small here); note that the output includes both beta and 
# sigmasq:
# c(-0.33,  0.78, -0.29,  0.47, -1.25,  0.05):


#######################################################################
## Read the data from a file, calculate the summary statistics and run 
## the Bayesian linear regression analysis
#######################################################################

Tsamp.out <- 100

## Assume non-zero y-intercept data.

# Read the files and compute summary statistics using the "read.regress.data.ff()" 
# function (in this package).


filename <- system.file('data/regressiondata.nz.all.csv.gz', package='BayesSummaryStatLM')
data.values <- read.regress.data.ff(filename)

# Calculate the number of predictors.

num.pred <- length(data.values$xty)-1

## Run all 6 combinations of priors for 3 "beta.prior" choices and 
#  2 "sigmasq.prior" choices:

beta.priors <- list(
  list( type = "flat"),
  
  list( type = "mvnorm.known", 
        mean.mu = rep(0.0,    (num.pred+1)), 
        prec.Cinv = diag(1.0, (num.pred+1))),
        
  list( type="mvnorm.unknown",
        mu.hyper.mean.eta         = rep(0.0,  (num.pred+1)),  
        mu.hyper.prec.Dinv    	  = diag(1.0, (num.pred+1)),  
        Cinv.hyper.df.lambda      = (num.pred+1), 
        Cinv.hyper.invscale.Vinv  = diag(1.0, (num.pred+1)),  
        mu.init                   = rep(1.0, (num.pred+1)),      
        Cinv.init                 = diag(1.0,(num.pred+1)))   
)

sigmasq.priors <- list(
  list(type = "inverse.gamma", inverse.gamma.a = 1.0, 
               inverse.gamma.b = 1.0, sigmasq.init = 0.5 ),
  list( type = "sigmasq.inverse", sigmasq.init = 0.5)
)

for (beta.prior in beta.priors)
{
  for(sigmasq.prior in sigmasq.priors)
  {

    set.seed(284698)
    MCMC.out <- bayes.regress(data.values, 
                              beta.prior, 
                              sigmasq.prior = sigmasq.prior, 
                              Tsamp.out = Tsamp.out)
                              
    print(c(colMeans(MCMC.out$beta), mean(MCMC.out$sigmasq)))
  }
}

# Check that output is close to simulated values (although num.samp and
# Tsamp.out are small here); note that the output includes both beta and                           
# sigmasq:
# c( 0.76, -0.92, 0.64, 0.57, -1.65, 0.25)

Run the code above in your browser using DataLab