medianInclusion.vs: Variable selection with DART

Description

This function implements the variable selection approach proposed in Linero (2018). Linero (2018) proposes DART, a variant of BART, which replaces the discrete uniform distribution for selecting a split variable with a categorical distribution of which the event probabilities follow a Dirichlet distribution. DART estimates the marginal posterior variable inclusion probability (MPVIP) for a predictor by the proportion of the posterior samples of the trees structures where the predictor is used as a split variable at least once, and selects predictors with MPVIP at least \(0.5\), yielding a median probability model.

Usage

medianInclusion.vs(
  x.train,
  y.train,
  probit = FALSE,
  vip.selection = TRUE,
  true.idx = NULL,
  plot = FALSE,
  num.var.plot = Inf,
  theta = 0,
  omega = 1,
  a = 0.5,
  b = 1,
  augment = FALSE,
  rho = NULL,
  xinfo = matrix(0, 0, 0),
  numcut = 100L,
  usequants = FALSE,
  cont = FALSE,
  rm.const = TRUE,
  power = 2,
  base = 0.95,
  split.prob = "polynomial",
  k = 2,
  ntree = 20L,
  ndpost = 1000L,
  nskip = 1000L,
  keepevery = 1L,
  printevery = 100L,
  verbose = FALSE
)

Arguments

x.train

A matrix or a data frame of predictors values with each row corresponding to an observation and each column corresponding to a predictor. If a predictor is a factor with \(q\) levels in a data frame, it is replaced with \(q\) dummy variables.

y.train

A vector of response (continuous or binary) values.

probit

A Boolean argument indicating whether the response variable is binary or continuous; probit=FALSE (by default) means that the response variable is continuous.

vip.selection

A Boolean argument indicating whether to select predictors using BART VIPs.

true.idx

(Optional) A vector of indices of the true relevant predictors; if provided, metrics including precision, recall and F1 score are returned.

plot

(Optional) A Boolean argument indicating whether plots are returned or not.

num.var.plot

The number of variables to be plotted.

theta

Set theta parameter; zero means random.

omega

Set omega parameter; zero means random.

A sparse parameter of \(Beta(a, b)\) hyper-prior where \(0.5<=a<=1\); a lower value induces more sparsity.

A sparse parameter of \(Beta(a, b)\) hyper-prior; typically, \(b=1\).

augment

A Boolean argument indicating whether data augmentation is performed in the variable selection procedure of Linero (2018).

rho

A sparse parameter; typically \(\rho = p\) where \(p\) is the number of predictors.

xinfo

A matrix of cut-points with each row corresponding to a predictor and each column corresponding to a cut-point. xinfo=matrix(0.0,0,0) indicates the cut-points are specified by BART.

numcut

The number of possible cut-points; If a single number is given, this is used for all predictors; Otherwise a vector with length equal to ncol(x.train) is required, where the \(i-\)th element gives the number of cut-points for the \(i-\)th predictor in x.train. If usequants=FALSE, numcut equally spaced cut-points are used to cover the range of values in the corresponding column of x.train. If usequants=TRUE, then min(numcut, the number of unique values in the corresponding column of x.train - 1) cut-point values are used.

usequants

A Boolean argument indicating how the cut-points in xinfo are generated; If usequants=TRUE, uniform quantiles are used for the cut-points; Otherwise, the cut-points are generated uniformly.

cont

A Boolean argument indicating whether to assume all predictors are continuous.

rm.const

A Boolean argument indicating whether to remove constant predictors.

power

The power parameter of the polynomial splitting probability for the tree prior. Only used if split.prob="polynomial".

base

The base parameter of the polynomial splitting probability for the tree prior if split.prob="polynomial"; if split.prob="exponential", the probability of splitting a node at depth \(d\) is base\(^d\).

split.prob

A string indicating what kind of splitting probability is used for the tree prior. If split.prob="polynomial", the splitting probability in Chipman et al. (2010) is used; If split.prob="exponential", the splitting probability in Rockova and Saha (2019) is used.

The number of prior standard deviations that \(E(Y|x) = f(x)\) is away from \(+/-.5\). The response (y.train) is internally scaled to the range from \(-.5\) to \(.5\). The bigger k is, the more conservative the fitting will be.

ntree

The number of trees in the ensemble.

ndpost

The number of posterior samples returned.

nskip

The number of posterior samples burned in.

keepevery

Every keepevery posterior sample is kept to be returned to the user.

printevery

As the MCMC runs, a message is printed every printevery iterations.

verbose

A Boolean argument indicating whether any messages are printed out.

Value

The function medianInclusion.vs() returns two (or one if vip.selection=FALSE) plots if plot=TRUE and a list with the following components.

dart.pvip

The vector of DART MPVIPs.

dart.pvip.imp.names

The vector of column names of the predictors with DART MPVIP at least \(0.5\).

dart.pvip.imp.cols

The vector of column indices of the predictors with DART MPVIP at least \(0.5\).

dart.precision

The precision score for the DART approach; only returned if true.idx is provided.

dart.recall

The recall score for the DART approach; only returned if true.idx is provided.

dart.f1

The F1 score for the DART approach; only returned if true.idx is provided.

bart.vip

The vector of BART VIPs; only returned if vip.selection=TRUE.

bart.vip.imp.names

The vector of column names of the predictors with BART VIP exceeding 1/ncol{x.train}; only returned if vip.selection=TRUE.

bart.vip.imp.cols

The vector of column indicies of the predictors with BART VIP exceeding 1/ncol{x.train}; only returned if vip.selection=TRUE.

bart.precision

The precision score for the BART approach; only returned if vip.selection=TRUE and true.idx is provided.

bart.recall

The recall score for the BART approach; only returned if vip.selection=TRUE and true.idx is provided.

bart.f1

The F1 score for the BART approach; only returned if vip.selection=TRUE and true.idx is provided.

Details

See Linero (2018) or Section 2.2.3 in Luo and Daniels (2021) for details. If vip.selection=TRUE, this function also does variable selection by selecting variables whose BART VIP exceeds 1/ncol{x.train}. If true.idx is provided, the precision, recall and F1 scores are returned. If plot=TRUE, plots showing which predictors are selected are generated.

References

Chipman, H. A., George, E. I. and McCulloch, R. E. (2010). "BART: Bayesian additive regression trees." Ann. Appl. Stat. 4 266--298.

Linero, A. R. (2018). "Bayesian regression trees for high-dimensional prediction and variable selection." J. Amer. Statist. Assoc. 113 626--636.

Luo, C. and Daniels, M. J. (2021) "Variable Selection Using Bayesian Additive Regression Trees." arXiv preprint arXiv:2112.13998.

Rockova V, Saha E (2019). <U+201C>On theory for BART.<U+201D> In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 2839<U+2013>2848). PMLR.

Examples

Run this code

# NOT RUN {
## simulate data (Scenario C.M.1. in Luo and Daniels (2021))
set.seed(123)
data = mixone(100, 10, 1, FALSE)
## test medianInclusion.vs() function
res = medianInclusion.vs(data$X, data$Y, probit=FALSE, vip.selection=TRUE,  
true.idx=c(1, 2, 6:8), plot=FALSE, ntree=10, ndpost=100, nskip=100, verbose=FALSE)
# }

Run the code above in your browser using DataLab