geomc.vs: Markov chain Monte Carlo for Bayesian variable selection using a geometric MH algorithm.

Description

geomc.vs uses a geometric approach to MCMC for performing Bayesian variable selection. It produces MCMC samples from the posterior density of a Bayesian hierarchical feature selection model.

Usage

geomc.vs(
  X,
  y,
  initial = NULL,
  n.iter = 50,
  burnin = 1,
  eps = 0.5,
  symm = TRUE,
  move.prob = c(0.4, 0.4, 0.2),
  lam0 = 0,
  a0 = 0,
  b0 = 0,
  lam = nrow(X)/ncol(X)^2,
  w = sqrt(nrow(X))/ncol(X),
  model.summary = FALSE,
  model.threshold = 0.5
)

Value

A list with components

samples: MCMC samples from $P(\gamma|y)$ returned as a $p \times$n.iter sparse lgCMatrix.
acceptance.rate: The acceptance rate based on all samples.
mip: The $p$ vector of marginal inclusion probabilities of all variables based on post burnin samples.
median.model: The median probability model based on post burnin samples.
beta.mean: The Monte Carlo estimate of posterior mean of $\beta$ (the $p+1$ vector c(intercept, regression coefficients)) based on post burnin samples.
wmip: The $p$ vector of weighted marginal inclusion probabilities of all variables based on post burnin samples.
wam: The weighted average model based on post burnin samples.
beta.wam: The model probability weighted estimate of posterior mean of $\beta$ (the $p+1$ vector c(intercept, regression coefficients)) based on post burnin samples.
log.post: The n.iter vector of log of the unnormalized marginal posterior pmf $P(\gamma|y)$ evaluated at the samples.

Arguments

X: The $n\times p$ covariate matrix without intercept. The following classes are supported: matrix and dgCMatrix. No need to center or scale this matrix manually. Scaling is performed implicitly and regression coefficients are returned on the original scale.
y: The response vector of length $n$. No need to center or scale.
initial: is the initial model (the set of active variables). Default: Null model.
n.iter: is the no. of samples needed. Default: 50.
burnin: is the value of burnin used to compute the median probability model. Default: 1.
eps: is the value for epsilon perturbation. Default: 0.5.
symm: indicates if the base density is of symmetric RW-MH. Default: True.
move.prob: is the vector of ('addition', 'deletion', 'swap') move probabilities. Default: (0.4,0.4,0.2). move.prob is used only when symm is set to False.
lam0: The precision parameter for $\beta_0$. Default: 0 (corresponding to improper uniform prior).
a0: The shape parameter for prior on $\sigma^2$. Default: 0.
b0: The scale parameter for prior on $\sigma^2$. Default: 0.
lam: The slab precision parameter. Default: $n/p^2$ as suggested by the theoretical results of Li, Dutta, Roy (2023).
w: The prior inclusion probability of each variable. Default: $\sqrt{n}/p$.
model.summary: If true, additional summaries are returned. Default: FALSE.
model.threshold: The threshold probability to select the covariates for the median model (median.model) and the weighted average model (wam). A covariate will be included in median.model (wam) if its marginal inclusion probability (weighted marginal inclusion probability) is greater than the threshold. Default: 0.5.

Author

Vivekananda Roy

Details

geomc.vs provides MCMC samples using the geometric MH algorithm of Roy (2024) for variable selection based on a hierarchical Gaussian linear model with priors placed on the regression coefficients as well as on the model space as follows: $$y | X, \beta_0,\beta,\gamma,\sigma^2,w,\lambda \sim N(\beta_01 + X_\gamma\beta_\gamma,\sigma^2I_n)$$ $$\beta_i|\beta_0,\gamma,\sigma^2,w,\lambda \stackrel{indep.}{\sim} N(0, \gamma_i\sigma^2/\lambda),~i=1,\ldots,p,$$ $$\beta_0|\gamma,\sigma^2,w,\lambda \sim N(0, \sigma^2/\lambda_0)$$ $$\sigma^2|\gamma,w,\lambda \sim Inv-Gamma (a_0, b_0)$$ $$\gamma_i|w,\lambda \stackrel{iid}{\sim} Bernoulli(w)$$ where $X_\gamma$ is the $n \times |\gamma|$ submatrix of $X$ consisting of those columns of $X$ for which $\gamma_i=1$ and similarly, $\beta_\gamma$ is the $|\gamma|$ subvector of $\beta$ corresponding to $\gamma$. The density $\pi(\sigma^2)$ of $\sigma^2 \sim Inv-Gamma (a_0, b_0)$ has the form $\pi(\sigma^2) \propto (\sigma^2)^{-a_0-1} \exp(-b_0/\sigma^2)$. The functions in the package also allow the non-informative prior $(\beta_{0}, \sigma^2)|\gamma, w \sim 1 / \sigma^{2}$ which is obtained by setting $\lambda_0=a_0=b_0=0$. geomc.vs provides the empirical MH acceptance rate and MCMC samples from the posterior pmf of the models $P(\gamma|y)$, which is available up to a normalizing constant. If $\code{model.summary}$ is set TRUE, geomc.vs also returns other model summaries. In particular, it returns the marginal inclusion probabilities (mip) computed by the Monte Carlo average as well as the weighted marginal inclusion probabilities (wmip) computed with weights $$w_i = P(\gamma^{(i)}|y)/\sum_{k=1}^K P(\gamma^{(k)}|y), i=1,2,...,K$$ where $\gamma^{(k)}, k=1,2,...,K$ are the distinct models sampled. Thus, if $N_k$ is the no. of times the $k$th distinct model $\gamma^{(k)}$ is repeated in the MCMC samples, the mip for the $j$th variable is $$mip_j = \sum_{k=1}^{K} N_k I(\gamma^{(k)}_j = 1)/n.iter$$ and wmip for the $j$th variable is $$wmip_j = \sum_{k=1}^K w_k I(\gamma^{(k)}_j = 1).$$ The median.model is the model containing variables $j$ with $mip_j > \code{model.threshold}$ and the wam is the model containing variables $j$ with $wmip_j > \code{model.threshold}$. Note that $E(\beta|\gamma, y)$, the conditional posterior mean of $\beta$ given a model $\gamma$ is available in closed form (see Li, Dutta, Roy (2023) for details). geomc.vs returns two estimates (beta.mean, beta.wam) of the posterior mean of $\beta$ computed as $$ beta.mean = \sum_{k=1}^{K} N_k E(\beta|\gamma^{(k)},y)/n.iter$$ and $$beta.wam = \sum_{k=1}^K w_k E(\beta|\gamma^{(k)},y),$$ respectively.

References

Roy, V.(2024) A geometric approach to informative MCMC sampling https://arxiv.org/abs/2406.09010

Li, D., Dutta, S., Roy, V.(2023) Model Based Screening Embedded Bayesian Variable Selection for Ultra-high Dimensional Settings, Journal of Computational and Graphical Statistics, 32, 61-73

Examples

Run this code

n=50; p=100; nonzero = 3
trueidx <- 1:3
nonzero.value <- 4
TrueBeta <- numeric(p)
TrueBeta[trueidx] <- nonzero.value
rho <- 0.5
xone <- matrix(rnorm(n*p), n, p)
X <- sqrt(1-rho)*xone + sqrt(rho)*rnorm(n)
y <- 0.5 + X %*% TrueBeta + rnorm(n)
result <- geomc.vs(X=X, y=y,model.summary = TRUE)
result$samples # the MCMC samples
result$acceptance.rate #the acceptance.rate
result$mip #marginal inclusion probabilities
result$wmip #weighted marginal inclusion probabilities
result$median.model #the median.model
result$wam #the weighted average model
result$beta.mean #the posterior mean of regression coefficients
result$beta.wam #another estimate of the posterior mean of regression coefficients
result$log.post #the log (unnormalized) posterior probabilities of the MCMC samples.

Run the code above in your browser using DataLab