Inference for ordinary least squares, lasso/NG, horseshoe and ridge regression models by (Gibbs) sampling from the Bayesian posterior distribution, augmented with Reversible Jump for model selection

```
bhs(X, y, T=1000, thin=NULL, RJ=TRUE, M=NULL, beta=NULL,
lambda2=1, s2=var(y-mean(y)), mprior=0, ab=NULL,
theta=0, rao.s2=TRUE, icept=TRUE, normalize=TRUE, verb=1)
bridge(X, y, T = 1000, thin = NULL, RJ = TRUE, M = NULL,
beta = NULL, lambda2 = 1, s2 = var(y-mean(y)), mprior = 0,
rd = NULL, ab = NULL, theta=0, rao.s2 = TRUE, icept = TRUE,
normalize = TRUE, verb = 1)
blasso(X, y, T = 1000, thin = NULL, RJ = TRUE, M = NULL,
beta = NULL, lambda2 = 1, s2 = var(y-mean(y)),
case = c("default", "ridge", "hs", "ng"), mprior = 0, rd = NULL,
ab = NULL, theta = 0, rao.s2 = TRUE, icept = TRUE,
normalize = TRUE, verb = 1)
```

X

`data.frame`

, `matrix`

, or vector of inputs `X`

y

vector of output responses `y`

of length equal to the
leading dimension (rows) of `X`

, i.e., `length(y) == nrow(X)`

T

total number of MCMC samples to be collected

thin

number of MCMC samples to skip before a sample is
collected (via thinning). If `NULL`

(default), then
`thin`

is determined based on the regression model implied
by `RJ`

, `lambda2`

, and `ncol(X)`

; and also
on the errors model implied by `theta`

and `nrow(X)`

RJ

if `TRUE`

then model selection on the columns of the
design matrix (and thus the parameter `beta`

in the model) is
performed by Reversible Jump (RJ) MCMC. The initial model is
specified by the `beta`

input, described below, and the maximal
number of covariates in the model is specified by `M`

M

the maximal number of allowed covariates (columns of
`X`

) in the model. If input `lambda2 > 0`

then any
`M <= ncol(X)`

is allowed. Otherwise it must be that
`M <= min(ncol(X), length(y)-1)`

, which is default value
when a `NULL`

argument is given

beta

initial setting of the regression coefficients. Any
zero-components will imply that the corresponding covariate (column
of `X`

) is not in the initial model. When input ```
RJ =
FALSE
```

(no RJ) and `lambda2 > 0`

(use lasso) then no
components are allowed to be exactly zero. The default setting is
therefore contextual; see below for details

lambda2

square of the initial lasso penalty parameter. If zero, then least squares regressions are used

s2

initial variance parameter

case

specifies if ridge regression, the
Normal-Gamma, or the horseshoe prior should be done instead
of the lasso; only meaningful when `lambda2 > 0`

mprior

prior on the number of non-zero regression coefficients
(and therefore covariates) `m`

in the model. The default
(`mprior = 0`

) encodes the uniform prior on `0 <= m <= M`

.
A scalar value `0 < mprior < 1`

implies a Binomial prior
`Bin(m|n=M,p=mprior)`

. A 2-vector `mprior=c(g,h)`

of positive values `g`

and `h`

represents
gives `Bin(m|n=M,p)`

prior where `p~Beta(g,h)`

rd

`=c(r, delta)`

, the alpha (shape) parameter and
\(\beta\) (rate) parameter to the gamma distribution prior
`G(r,delta)`

for the \(\lambda^2\) parameter under
the lasso model; or, the \(\alpha\) (shape) parameter and
\(\beta\) (scale) parameter to the
inverse-gamma distribution `IG(r/2, delta/2)`

prior for
the \(\lambda^2\)
parameter under the ridge regression model. A default of `NULL`

generates appropriate non-informative values depending on the
nature of the regression. Specifying `rd=FALSE`

causes
`lambda2`

values to be fixed at their starting value, i.e., not
sampled. See the details below for information
on the special settings for ridge regression

ab

`=c(a, b)`

, the \(\alpha\) (shape)
parameter and the \(\beta\) (scale) parameter for the
inverse-gamma distribution prior `IG(a,b)`

for the variance
parameter `s2`

. A default of `NULL`

generates appropriate
non-informative values depending on the nature of the regression

theta

the rate parameter (`> 0`

) to the exponential prior
on the degrees of freedom paramter `nu`

under a model with
Student-t errors implemented by a scale-mixture prior.
The default setting of `theta = 0`

turns off this prior,
defaulting to a normal errors prior

rao.s2

indicates whether Rao-Blackwellized samples for
\(\sigma^2\) should be used (default `TRUE`

); see
below for more details

icept

if `TRUE`

, an implicit intercept term is fit
in the model, otherwise the the intercept is zero; default is
`TRUE`

normalize

if `TRUE`

, each variable is standardized
to have unit L2-norm, otherwise it is left alone; default is
`TRUE`

verb

verbosity level; currently only `verb = 0`

and
`verb = 1`

are supported

`blasso`

returns an object of class `"blasso"`

, which is a
`list`

containing a copy of all of the input arguments as well as
of the components listed below.

a copy of the function call as used

a vector of `T`

samples of the (un-penalized)
“intercept” parameter

a `T*ncol(X)`

`matrix`

of `T`

samples from
the (penalized) regression coefficients

the number of non-zero entries in each vector of `T`

samples of `beta`

a vector of `T`

samples of the variance parameter

a vector of `T`

samples of the penalty
parameter

a vector of `T`

with the gamma parameter
when `case = "ng"`

a `T*ncol(X)`

`matrix`

of `T`

samples from
the (latent) inverse diagonal of the prior covariance matrix for
`beta`

, obtained for Lasso regressions

a `T*nrow(X)`

`matrix`

of `T`

samples
from the (latent) diagonal of the covariance matrix of the response
providing a scale-mixture implementation of Student-t errors with
degrees of freedom `nu`

when active (input `theta > 0`

)

a vector of `T`

samples of the degrees of freedom
parameter to the Student-t errors mode when active
(input `theta > 0`

)

a vector of `T`

samples of the Binomial proportion
`p`

that was given a Beta prior, as described above for the
2-vector version of the `mprior`

input

the log posterior probability of each (saved) sample of the joint parameters

the log likelihood of each (saved) sample of the parameters

the log likelihood of each (saved) sample of the
parameters under the Normal errors model when sampling under the
Student-t model; i.e., it is not present
unless `theta > 0`

The Bayesian lasso model and Gibbs Sampling algorithm is described
in detail in Park & Casella (2008). The algorithm implemented
by this function is identical to that described therein, with
the exception of an added “option” to use a Rao-Blackwellized
sample of \(\sigma^2\) (with \(\beta\) integrated out)
for improved mixing, and the model selections by RJ described below.
When input argument `lambda2 = 0`

is
supplied, the model is a simple hierarchical linear model where
\((\beta,\sigma^2)\) is given a Jeffrey's prior

Specifying `RJ = TRUE`

causes Bayesian model selection and
averaging to commence for choosing which of the columns of the
design matrix `X`

(and thus parameters `beta`

) should be
included in the model. The zero-components of the `beta`

input
specify which columns are in the initial model, and
`M`

specifies the maximal number of columns.

The RJ mechanism implemented here for the Bayesian lasso model selection differs from the one described by Hans (2009), which is based on an idea from Geweke (1996). Those methods require departing from the Park & Casella (2008) latent-variable model and requires sampling from each conditional \(\beta_i | \beta_{(-i)}, \dots\) for all \(i\), since a mixture prior with a point-mass at zero is placed on each \(\beta_i\). Out implementation here requires no such special prior and retains the joint sampling from the full \(\beta\) vector of non-zero entries, which we believe yields better mixing in the Markov chain. RJ proposals to increase/decrease the number of non-zero entries does proceed component-wise, but the acceptance rates are high due due to marginalized between-model moves (Troughton & Godsill, 1997).

When the lasso prior or RJ is used, the automatic thinning level
(unless `thin != NULL`

) is determined by the number of columns
of `X`

since this many latent variables are introduced

Bayesian ridge regression is implemented as a special case via the
`bridge`

function. This essentially calls `blasso`

with `case = "ridge"`

. A default setting of `rd = c(0,0)`

is
implied by `rd = NULL`

, giving the Jeffery's prior for the
penalty parameter \(\lambda^2\) unless ```
ncol(X) >=
length(y)
```

in which case the proper specification of ```
rd =
c(5,10)
```

is used instead.

The Normal--Gamma prior (Griffin & Brown, 2009) is implemented as
an extension to the Bayesian lasso with `case = "ng"`

. Many
thanks to James Scott for providing the code needed to extend the
method(s) to use the horseshoe prior (Carvalho, Polson, Scott, 2010).

When `theta > 0`

then the Student-t errors via scale mixtures
(and thereby extra latent variables `omega2`

) of Geweke (1993)
is applied as an extension to the Bayesian lasso/ridge model.
If Student-t errors are used the automatic thinning level
is augmented (unless `thin != NULL`

) by the number of rows
in `X`

since this many latent variables are introduced

Park, T., Casella, G. (2008). *The Bayesian Lasso.*
Journal of the American Statistical Association, 103(482),
June 2008, pp. 681-686
http://www.stat.ufl.edu/~casella/Papers/Lasso.pdf

Griffin, J.E. and Brown, P.J. (2009).
*Inference with Normal-Gamma prior distributions in
regression problems.* Bayesian Analysis, 5, pp. 171-188.
http://projecteuclid.org/euclid.ba/1340369797

Hans, C. (2009). *Bayesian Lasso regression.*
Biometrika 96, pp. 835-845.
http://biomet.oxfordjournals.org/content/96/4/835.abstract

Carvalho, C.M., Polson, N.G., and Scott, J.G. (2010) *The
horseshoe estimator for sparse signals.* Biometrika 97(2):
pp. 465-480.
http://ftp.stat.duke.edu/WorkingPapers/08-31.pdf

Geweke, J. (1996). *Variable selection and model comparison
in regression.* In Bayesian Statistics 5. Editors: J.M. Bernardo,
J.O. Berger, A.P. Dawid and A.F.M. Smith, 609-620. Oxford Press.

Paul T. Troughton and Simon J. Godsill (1997).
*A reversible jump sampler for autoregressive time series,
employing full conditionals to achieve efficient model space moves.*
Technical Report CUED/F-INFENG/TR.304, Cambridge University
Engineering Department.

Geweke, J. (1993) *Bayesian treatment of the independent
Student-t linear model.* Journal of Applied Econometrics, Vol. 8,
S19-S40

`lm`

,
`lars`

in the lars package,
`regress`

,
`lm.ridge`

in the MASS package

# NOT RUN { ## following the lars diabetes example data(diabetes) attach(diabetes) ## Ordinary Least Squares regression reg.ols <- regress(x, y) ## Lasso regression reg.las <- regress(x, y, method="lasso") ## Bayesian Lasso regression reg.blas <- blasso(x, y) ## summarize the beta (regression coefficients) estimates plot(reg.blas, burnin=200) points(drop(reg.las$b), col=2, pch=20) points(drop(reg.ols$b), col=3, pch=18) legend("topleft", c("blasso-map", "lasso", "lsr"), col=c(2,2,3), pch=c(21,20,18)) ## plot the size of different models visited plot(reg.blas, burnin=200, which="m") ## get the summary s <- summary(reg.blas, burnin=200) ## calculate the probability that each beta coef != zero s$bn0 ## summarize s2 plot(reg.blas, burnin=200, which="s2") s$s2 ## summarize lambda2 plot(reg.blas, burnin=200, which="lambda2") s$lambda2 # } # NOT RUN { ## fit with Student-t errors ## (~400-times slower due to automatic thinning level) regt.blas <- blasso(x, y, theta=0.1) ## plotting some information about nu, and quantiles plot(regt.blas, "nu", burnin=200) quantile(regt.blas$nu[-(1:200)], c(0.05, 0.95)) ## Bayes Factor shows strong evidence for Student-t model mean(exp(regt.blas$llik[-(1:200)] - regt.blas$llik.norm[-(1:200)])) # } # NOT RUN { ## clean up detach(diabetes) # }