bayes.regress: MCMC posterior sampling of Bayesian linear regression model parameters using only summary statistics

Description

This function generates MCMC posterior samples of the Bayesian linear regression model parameters, using only summary statistics $X'X$, $X'Y$ and $Y'Y$ (e.g. calculated by the function read.regress.data.ff() in this package). The samples are generated according to the user specified choices of prior distributions, hyperprior distributions and fixed parameter values where required; the user also specifies starting values for unknown model parameters.

Usage

bayes.regress(data.values=NULL, 
              beta.prior=list("flat"), 
              sigmasq.prior=list("inverse.gamma", 1.0, 1.0, 1.0),
              Tsamp.out=1000, zero.intercept=FALSE)

Arguments

data.values

a list with four (optionally five) components, which are created by the function read.regress.data.ff() (in this package):

xtx: a square matrix that stores the product $X'X$, where $X$ is the data from predictor columns with a leading column of 1's for the y-intercept term.
xty: a column vector that stores the product $X'Y$, where $X$ is the same as above and $Y$ is a column of response data values.
yty: a scalar value that stores the product $Y'Y$, where $Y$ is the same as above.
numsamp.data: an integer equal to the number of data values of the predictor variables $X$.
xtx.inv (optional): the inverse of the matrix xtx that is used for the “Uniform” prior distribution for $\beta$ to speed up computations if the function is used repeatedly with the same xtx. If omitted, this inverse will be computed automatically. This component is ignored for other prior distributions.

beta.prior

a list that specifies the characteristics of the prior distribution for $\beta$, the vector of coefficients of the Bayesian linear regression model. There are three possible types:

mvnorm.known: Multivariate Normal with known mean vector $\mu$ and known covariance matrix $C$.
mvnorm.unknown: Multivariate Normal with unknown mean vector $\mu$ and unknown covariance matrix $C$. This prior also includes the hyperpriors for $\mu$ and $C$, where $\mu ~ $ Multivariate Normal($\eta, D)$, and $C^{-1} \sim$ Wishart(d.f. = $\lambda$, scale matrix = $V$); $\eta, D, \lambda, V$ assumed known.

In each of these three prior types, the list has a different structure, as follows:

beta.prior=list(type = "flat"): a Uniform prior distribution for $\beta$; no other specification is necessary. This prior distribution is used by default.
beta.prior=list(type = "mvnorm.known", mean.mu = ..., cov.C = ..., prec.Cinv = ... )
- mean.mu: the fixed known prior mean vector $\mu$ for the Multivariate Normal prior of $\beta$. The default is a vector of 0's with length equal to the length of $\beta$.
- cov.C: the fixed known prior covariance matrix $C$ for the Multivariate Normal prior of $\beta$. The default is an identity matrix with dimension equal to the length of $\beta$.
- prec.Cinv: the inverse of the covariance matrix $C$ above. If cov.C is not specified, prec.Cinv is assigned the identity matrix by default, with dimension equal to the length of $\beta$.
It is advised to supply prec.Cinv matrix and omit cov.C for speeding up the algorithm. In case both are supplied, the algorithm gives preference to prec.Cinv.
beta.prior=list(type = "mvnorm.unknown", mu.hyper.mean.eta = ..., mu.hyper.prec.Dinv = ..., Cinv.hyper.df.lambda = ..., Cinv.hyper.invscale.Vinv = ..., mu.init = ..., Cinv.init = ...)
- mu.hyper.mean.eta: the fixed known hyperparameter mean vector $\eta$ for the Multivariate Normal hyperprior mean $\mu$. The default is a vector of 0's with length equal to the length of $\beta$.
- mu.hyper.prec.Dinv: the fixed known hyperparameter precision matrix $D^{-1}$ for the Multivariate Normal hyperprior mean $\mu$. The default is an identity matrix with dimension equal to the length of $\beta$.
- Cinv.hyper.df.lambda: the fixed known degrees of freedom $\lambda$ for the Wishart hyperprior for $C^{-1}$. The default value is the length of $\beta$ .
- Cinv.hyper.invscale.Vinv: the fixed known hyperparameter inverse scale matrix $V^{-1}$ for the Wishart hyperprior for $C^{-1}$. The default is an identity matrix with dimension equal to the length of $\beta$.
- mu.init: initial value for $\mu$ for the MCMC chain. The default is a vector of 1's with length equal to the length of $\beta$.
- Cinv.init: initial value for $C^{-1}$ for the MCMC chain. The default is an identity matrix with dimension equal to the length of $\beta$.

For all three of the above beta.prior distributions, only the type is mandatory; the remaining parameters are assigned default values if omitted.

sigmasq.prior

a list that specifies the characteristics of the prior distribution for $\sigma^2$ (the variance of $\epsilon_i$, i.e. the variance of the error terms in the Bayesian linear regression model). There are two types:

inverse.gamma: Inverse Gamma distribution with known shape and scale parameters $a$ and $b$, respectively.
sigmasq.inverse: inverse sigma-squared distribution.

Similar to beta.prior above, the structure of the list depends on the type of prior distribution chosen. The list must be supplied in either of the following structures:

sigmasq.prior=list(type = "inverse.gamma", inverse.gamma.a = ..., inverse.gamma.b = ..., sigmasq.init = ...)
- inverse.gamma.a: the shape parameter $a$ for the Inverse Gamma prior distribution, assumed known; default = 1.
- inverse.gamma.b: the scale parameter $b$ for the Inverse Gamma prior distribution, assumed known; default = 1.
- sigmasq.init: the initial value for the unknown $\sigma^2$ parameter for the MCMC chain; default = 1.
sigmasq.prior=list(type="sigmasq.inverse", sigmasq.init = ...).
- sigmasq.init: the initial value for the unknown $\sigma^2$ parameter for the MCMC chain; default = 1.

Tsamp.out

an optional scalar that specifies the number of MCMC samples to generate; default = 1,000.

zero.intercept

an optional logical parameter with default = FALSE. If zero.intercept = TRUE is specified, the linear regression model sets the y-intercept term $\beta_0$ to zero; the corresponding y-intercept terms of the matrices data.values$xtx and data.values$xty are ignored, and the $\beta$ vector is revised throughout the models and output automatically by the function.

Value

The returned value is a list containing the MCMC samples of the unknown Bayesian linear regression model parameters; the number of MCMC samples is equal to the argument Tsamp.out. Further analysis, including plotting and creating summary statistics, can be carried out using the 'coda' R package (see References).

Details

This function uses the following Bayesian linear regression model: $$y_i=x_i' \beta + \epsilon_i,$$ where $i = 1,...,\mathbf{numsamp.data}$; $\epsilon_i \sim N(0,\sigma^2)$; $k$ is the number of predictor variables. The function uses user-supplied prior distributions for $\beta$ and $\sigma^2$.

The Gibbs sampler is used to sample from all full conditional posterior distributions, which only depend on the summary statistics $X'X$, $X'Y$ and $Y'Y$ (and $Y'X = (X'Y)'$); these summary statistics are calculated by the function read.regress.data.ff() (in this package), or can be provided by the user. Starting values are not needed for the vector $\beta$, since this vector is updated first, conditioned on all other unknown model parameters and the data.

The full conditional posterior distributions are the following for each prior specification of $\beta$; these depend on the data only through summary statistics $X'X$ and $X'Y$:
- beta.prior=list(type = "flat"): $$\beta | \sigma^2, X, Y \sim Normal_{k+1} (mean=((X'X)^{-1}(X'Y), covariance=(\sigma^2(X'X)^{-1})))$$
- beta.prior=list(type = "mvnorm.known"): $$\beta | \sigma^2, X, Y \sim Normal_{k+1} (mean=(C^{-1}+\sigma^{-2}(X'X))^{-1}(C^{-1}\mu + \sigma^{-2}X'Y),covariance=(C^{-1}+\sigma^{-2}(X'X)^{-1}))$$
- beta.prior=list(type = "mvnorm.unknown"): $$\beta | \sigma^2, \mu, C^{-1}, X, Y ~ Normal_{k+1} (mean=(C^{-1}+\sigma^{-2}(X'X))^{-1}(C^{-1}\mu + \sigma^{-2}X'Y),covariance=(C^{-1}+\sigma^{-2}(X'X)^{-1}))$$ $$\mu | \beta, \sigma^2, C^{-1}, X, Y ~ Normal_{k+1} (mean=(D^{-1}+C^{-1})^{-1}(C^{-1}\beta+D^{-1}\eta), covariance=(D^{-1}+C^{-1})^{-1})$$ $$C^{-1} | \beta, \sigma^2, \mu, X, Y ~ Wishart_{k+1} (d.f. = (1+\lambda), scale matrix = (V^{-1}+ (\beta - \mu)(\beta - \mu)')^{-1})$$

The full conditional posterior distributions are the following for each prior specification of $\sigma^2$; these depend on the data only through summary statistics $X'X$, $X'Y$ and $Y'Y$:
- sigmasq.prior=list(type = "inverse.gamma"): $$\sigma^2 | \beta, X, Y \sim \mathrm{Inv{-}Gamma} \left(\frac{\mathrm{numsamp.data}}{2}+a, \left(\frac{1}{2}(Y'Y-\beta'X'Y-Y'X\beta+\beta'X'X\beta)+1/b\right)^{-1}\right)$$
- sigmasq.prior=list(type = "sigmasq.inverse"): $$\sigma^2 | \beta, X, Y \sim \mathrm{Inv{-}Gamma} \left(\frac{\mathrm{numsamp.data}}{2}, \left(\frac{1}{2}(Y'Y-\beta'X'Y-Y'X\beta+\beta'X'X\beta)\right)^{-1}\right)$$

References

Carlin, B.P. and Louis, T.A. (2009) Bayesian Methods for Data Analysis, 3rd ed., Boca Raton, FL: Chapman and Hall/CRC Press.

Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A. and Rubin, D.B. (2013) Bayesian Data Analysis, 3rd ed., Boca Raton, FL: Chapman and Hall/CRC Press.

Plummer, M., Best, N., Cowles, K. and Vines, K. (2006) CODA: Convergence diagnosis and output analysis for MCMC. R News, 6(1), 7-11.

Adler, D., Glaser, C., Nenadic, O., Oehlschlagel, J. and Zucchini, W. (2013) ff: memory-efficient storage of large data on disk and fast access functions. R package: https://CRAN.R-project.org/package=ff.

Fasiolo, M. (2014) An introduction to mvnfast. R package: https://CRAN.R-project.org/package=mvnfast.

Examples

Run this code

# NOT RUN {
##################################################
## Simulate data 
##################################################

set.seed(284698)

num.samp  <- 100 # number of data values to simulate

# The first value of the beta vector is the y-intercept:
beta <- c(-0.33, 0.78, -0.29, 0.47, -1.25)

# Calculate the number of predictor variables:
num.pred <- length(beta)-1

rho       <- 0.0  # correlation between predictors
mean.vec  <- rep(0,num.pred)
sigma.mat <- matrix(rho,num.pred,num.pred) + diag(1-rho,num.pred)
sigmasq.sim <- 0.05

# Simulate predictor variables:
x.pre       <- rmvn(num.samp, mu=mean.vec, sigma=sigma.mat)       

# Add leading column of 1's to x.pre for y-intercept:
x <- cbind(rep(1,num.samp),x.pre)

epsilon <- rnorm(num.samp, mean=0, sd=sqrt(sigmasq.sim))

y  <- as.numeric( x %*% as.matrix(beta) +  epsilon)

## Compute summary statistics (alternatively, the
# "read.regress.data.ff() function (in this package) can be 
# used to calculate summary statistics; see example below).

xtx <- t(x)%*%x 
xty <- t(x)%*%y 
yty <- t(y)%*%y 

data.values<-list(xtx=xtx, xty=xty, yty=yty,
                  numsamp.data = num.samp, 
                  xtx.inv = chol2inv(chol(xtx)))

##########################################################
## Bayesian linear regression analysis
##########################################################

Tsamp.out <- 100 # number of MCMC samples to produce

## Choose priors for beta and sigma-squared. Here,
# beta: Uniform prior; sigma-squared: Inverse Gamma prior. 

beta.prior    <- list( type = "flat")
sigmasq.prior <- list(type = "inverse.gamma", inverse.gamma.a = 1.0, 
                      inverse.gamma.b = 1.0, sigmasq.init = 1.0 )

set.seed(284698)

# Run the "bayes.regress()" function using the data simulated above.

MCMC.out <- bayes.regress(data.values, 
                          beta.prior, 
                          sigmasq.prior = sigmasq.prior, 
                          Tsamp.out = Tsamp.out)

# Next, print the posterior means of the unknown model parameters.
# Alternatively, the "coda" package can be used for analysis.

print(c(colMeans(MCMC.out$beta), mean(MCMC.out$sigmasq)))

# Check that output is close to simulated values (although num.samp and
# Tsamp.out are small here); note that the output includes both beta and 
# sigmasq:
# c(-0.33,  0.78, -0.29,  0.47, -1.25,  0.05)

## Run all 6 combinations of priors for 3 "beta.prior" choices and 
#  2 "sigmasq.prior" choices:

beta.priors <- list(
  list( type = "flat"),
  
  list( type = "mvnorm.known", 
        mean.mu = rep(0.0,    (num.pred+1)), 
        prec.Cinv = diag(1.0, (num.pred+1))),
        
  list( type = "mvnorm.unknown",
        mu.hyper.mean.eta         = rep(0.0,(num.pred+1)),  
        mu.hyper.prec.Dinv        = diag(1.0, (num.pred+1)),  
        Cinv.hyper.df.lambda      = (num.pred+1), 
        Cinv.hyper.invscale.Vinv  = diag(1.0, (num.pred+1)),  
        mu.init                   = rep(1.0, (num.pred+1)), 
        Cinv.init                 = diag(1.0,(num.pred+1)))   
)

sigmasq.priors <- list(
  list(type = "inverse.gamma", 
       inverse.gamma.a = 1.0, 
       inverse.gamma.b = 1.0, 
       sigmasq.init = 0.1 ),       
  list( type="sigmasq.inverse", sigmasq.init = 0.1)
)

for (beta.prior in beta.priors)
{
  for(sigmasq.prior in sigmasq.priors)
  {
    set.seed(284698)
    MCMC.out <- bayes.regress(data.values, 
                              beta.prior, 
                              sigmasq.prior = sigmasq.prior, 
                              Tsamp.out = Tsamp.out)
    print(c(colMeans(MCMC.out$beta), mean(MCMC.out$sigmasq)))
  }
}

# Check that output is close to simulated values (although num.samp and
# Tsamp.out are small here); note that the output includes both beta and 
# sigmasq:
# c(-0.33,  0.78, -0.29,  0.47, -1.25,  0.05):


#######################################################################
## Read the data from a file, calculate the summary statistics and run 
## the Bayesian linear regression analysis
#######################################################################

Tsamp.out <- 100

## Assume non-zero y-intercept data.

# Read the files and compute summary statistics using the "read.regress.data.ff()" 
# function (in this package).


filename <- system.file('data/regressiondata.nz.all.csv.gz', package='BayesSummaryStatLM')
data.values <- read.regress.data.ff(filename)

# Calculate the number of predictors.

num.pred <- length(data.values$xty)-1

## Run all 6 combinations of priors for 3 "beta.prior" choices and 
#  2 "sigmasq.prior" choices:

beta.priors <- list(
  list( type = "flat"),
  
  list( type = "mvnorm.known", 
        mean.mu = rep(0.0,    (num.pred+1)), 
        prec.Cinv = diag(1.0, (num.pred+1))),
        
  list( type="mvnorm.unknown",
        mu.hyper.mean.eta         = rep(0.0,  (num.pred+1)),  
        mu.hyper.prec.Dinv    	  = diag(1.0, (num.pred+1)),  
        Cinv.hyper.df.lambda      = (num.pred+1), 
        Cinv.hyper.invscale.Vinv  = diag(1.0, (num.pred+1)),  
        mu.init                   = rep(1.0, (num.pred+1)),      
        Cinv.init                 = diag(1.0,(num.pred+1)))   
)

sigmasq.priors <- list(
  list(type = "inverse.gamma", inverse.gamma.a = 1.0, 
               inverse.gamma.b = 1.0, sigmasq.init = 0.5 ),
  list( type = "sigmasq.inverse", sigmasq.init = 0.5)
)

for (beta.prior in beta.priors)
{
  for(sigmasq.prior in sigmasq.priors)
  {

    set.seed(284698)
    MCMC.out <- bayes.regress(data.values, 
                              beta.prior, 
                              sigmasq.prior = sigmasq.prior, 
                              Tsamp.out = Tsamp.out)
                              
    print(c(colMeans(MCMC.out$beta), mean(MCMC.out$sigmasq)))
  }
}

# Check that output is close to simulated values (although num.samp and
# Tsamp.out are small here); note that the output includes both beta and                           
# sigmasq:
# c( 0.76, -0.92, 0.64, 0.57, -1.65, 0.25)


# }

Run the code above in your browser using DataLab