mc.permute.vs: Permutation-based variable selection approach with parallel computation

Description

This function implements the permutation-based variable selection approach for BART (see Algorithm 1 in Luo and Daniels (2021) for details) with parallel computation used in computing the null variable importance scores. Three types of variable importance measures are considered: BART variable inclusion proportions (VIP), BART within-type variable inclusion proportions (within-type VIP) and BART Metropolis Importance (MI). The permutation-based variable selection approach using BART VIP as the variable importance measure is proposed by Bleich et al. (2014). BART within-type VIP and BART MI are proposed by Luo and Daniels (2021), for the sake of the existence of mixed-type predictors and the goal of allowing more relevant predictors into the model.

Usage

mc.permute.vs(
  x.train,
  y.train,
  probit = FALSE,
  npermute = 100L,
  nreps = 10L,
  alpha = 0.05,
  true.idx = NULL,
  plot = TRUE,
  n.var.plot = Inf,
  xinfo = matrix(0, 0, 0),
  numcut = 100L,
  usequants = FALSE,
  cont = FALSE,
  rm.const = TRUE,
  k = 2,
  power = 2,
  base = 0.95,
  split.prob = "polynomial",
  ntree = 20L,
  ndpost = 1000,
  nskip = 1000,
  keepevery = 1L,
  verbose = FALSE,
  mc.cores = 2L,
  nice = 19L,
  seed = 99L
)

Arguments

x.train

A matrix or a data frame of predictors values with each row corresponding to an observation and each column corresponding to a predictor. If a predictor is a factor with \(q\) levels in a data frame, it is replaced with \(q\) dummy variables.

y.train

A vector of response (continuous or binary) values.

probit

A Boolean argument indicating whether the response variable is binary or continuous; probit=FALSE (by default) means that the response variable is continuous.

npermute

The number of permutations for estimating the null distributions of the variable importance scores.

nreps

The number of replications for obtaining the averaged (or median) variable importance scores based on the original data set.

alpha

A number between \(0\) and \(1\); a predictor is selected if its averaged (or median) variable importance score exceeds the \(1-\alpha\) quantile of the corresponding null distribution.

true.idx

(Optional) A vector of indices of the true relevant predictors; if provided, metrics including precision, recall and F1 score will be returned.

plot

(Optional) A Boolean argument indicating whether plots are returned or not.

n.var.plot

The number of variables to be plotted.

xinfo

A matrix of cut-points with each row corresponding to a predictor and each column corresponding to a cut-point. xinfo=matrix(0.0,0,0) indicates the cut-points are specified by BART.

numcut

The number of possible cut-points; If a single number is given, this is used for all predictors; Otherwise a vector with length equal to ncol(x.train) is required, where the \(i-\)th element gives the number of cut-points for the \(i-\)th predictor in x.train. If usequants=FALSE, numcut equally spaced cut-points are used to cover the range of values in the corresponding column of x.train. If usequants=TRUE, then min(numcut, the number of unique values in the corresponding column of x.train - 1) cut-point values are used.

usequants

A Boolean argument indicating how the cut-points in xinfo are generated; If usequants=TRUE, uniform quantiles are used for the cut-points; Otherwise, the cut-points are generated uniformly.

cont

A Boolean argument indicating whether to assume all predictors are continuous.

rm.const

A Boolean argument indicating whether to remove constant predictors.

The number of prior standard deviations that \(E(Y|x) = f(x)\) is away from \(+/-.5\). The response (y.train) is internally scaled to the range from \(-.5\) to \(.5\). The bigger k is, the more conservative the fitting will be.

power

The power parameter of the polynomial splitting probability for the tree prior. Only used if split.prob="polynomial".

base

The base parameter of the polynomial splitting probability for the tree prior if split.prob="polynomial"; if split.prob="exponential", the probability of splitting a node at depth \(d\) is base\(^d\).

split.prob

A string indicating what kind of splitting probability is used for the tree prior. If split.prob="polynomial", the splitting probability in Chipman et al. (2010) is used; If split.prob="exponential", the splitting probability in Rockova and Saha (2019) is used.

ntree

The number of trees in the ensemble.

ndpost

The number of posterior samples returned.

nskip

The number of posterior samples burned in.

keepevery

Every keepevery posterior sample is kept to be returned to the user.

verbose

A Boolean argument indicating whether any messages are printed out.

mc.cores

The number of cores to employ in parallel.

nice

Set the job niceness. The default niceness is \(19\) and niceness goes from \(0\) (highest) to \(19\) (lowest).

seed

Seed required for reproducible MCMC.

Value

The function mc.permute.vs() returns three (or two if the predictors are of the same type) plots if plot=TRUE and a list with the following components.

vip.imp.cols

The vector of column indices of the predictors selected by the approach using VIP as the variable importance score.

vip.imp.names

The vector of column names of the predictors selected by the approach using VIP as the variable importance score.

avg.vip

The vector (length=ncol(x.train)) of the averaged VIPs based on the original data set; avg.vip=colMeans(avg.vip.mtx).

avg.vip.mtx

A matrix of VIPs based on the original data set, with each row corresponding to a repetition and each column corresponding to a predictor.

permute.vips

A matrix of VIPs based on the null data sets, with each row corresponding to a permutation (null data set) and each column corresponding to a predictor.

within.type.vip.imp.cols

The vector of column indices of the predictors selected by the approach using within-type VIP as the variable importance score.

within.type.vip.imp.names

The vector of column names of the predictors selected by the approach using within-type VIP as the variable importance score.

avg.within.type.vip

The vector (length=ncol(x.train)) of the averaged within-type VIPs based on the original data set; avg.within.type.vip=colMeans(avg.within.type.vip.mtx).

avg.within.type.vip.mtx

A matrix of within-type VIPs based on the original data set, with each row corresponding to a repetition and each column corresponding to a predictor.

permute.within.type.vips

A matrix of within VIPs based on the null data sets, with each row corresponding to a permutation (null data set) and each column corresponding to a predictor.

mi.imp.cols

The vector of column indices of the predictors selected by the approach using MI as the variable importance score.

mi.imp.names

The vector of column names of the predictors selected by the approach using MI as the variable importance score.

median.mi

The vector (length=ncol(x.train)) of the median MIs based on the original data set; median.mi=colMeans(median.mi.mtx).

median.mi.mtx

A matrix of MIs based on the original data set, with each row corresponding to a repetition and each column corresponding to a predictor.

permute.mis

A matrix of MIs based on the null data sets, with each row corresponding to a permutation (null data set) and each column corresponding to a predictor.

true.idx

A vector of indices of the true relevant predictors; only returned if true.idx is provided as inputs.

vip.precision

The precision score for the approach using VIP as the variable importance score; only returned if true.idx is provided.

vip.recall

The recall score for the approach using VIP as the variable importance score; only returned if true.idx is provided.

vip.f1

The F1 score for the approach using VIP as the variable importance score; only returned if true.idx is provided.

wt.vip.precision

The precision score for the approach using within-VIP as the variable importance score; only returned when the predictors are of the same type and true.idx is provided.

wt.vip.recall

The recall score for the approach using within-VIP as the variable importance score; only returned when the predictors are of the same type and true.idx is provided.

wt.vip.f1

The F1 score for the approach using within-VIP as the variable importance score; only returned when the predictors are of the same type and true.idx is provided.

mi.precision

The precision score for the approach using MI as the variable importance score; only returned if true.idx is provided.

mi.recall

The recall score for the approach using MI as the variable importance score; only returned if true.idx is provided.

mi.f1

The F1 score for the approach using MI as the variable importance score; only returned if true.idx is provided.

Details

The detailed algorithm can be found in Algorithm 1 in Luo and Daniels (2021). The permutation-based variable selection approach using within-type VIP as the variable importance measure is only used when the predictors are of mixed-type; otherwise, it is the same as the one using VIP as the variable importance measure. If true.idx is provided, the precision, recall and F1 scores will be returned for the three (or two if the predictors are of the same type) methods. If plot=TRUE, three (or two if the predictors are of the same type) plots showing which predictors are selected are generated.

References

Bleich, Justin et al. (2014). "Variable selection for BART: an application to gene regulation." Ann. Appl. Stat. 8.3, pp 1750--1781.

Chipman, H. A., George, E. I. and McCulloch, R. E. (2010). "BART: Bayesian additive regression trees." Ann. Appl. Stat. 4 266--298.

Luo, C. and Daniels, M. J. (2021) "Variable Selection Using Bayesian Additive Regression Trees." arXiv preprint arXiv:2112.13998.

Rockova V, Saha E (2019). <U+201C>On theory for BART.<U+201D> In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 2839<U+2013>2848). PMLR.

Examples

Run this code

# NOT RUN {
## simulate data (Scenario C.M.1. in Luo and Daniels (2021))
set.seed(123)
data = mixone(100, 10, 1, FALSE)
## parallel::mcparallel/mccollect do not exist on windows
if(.Platform$OS.type=='unix') {
## test mc.permute.vs() function
  res = mc.permute.vs(data$X, data$Y, probit=FALSE, npermute=100, nreps=10, alpha=0.05, 
  true.idx=c(1, 2, 6:8), plot=FALSE, ntree=10, ndpost=100, nskip=100, mc.cores=2)
}
# }

Run the code above in your browser using DataLab