pdbart: Partial Dependence Plots for BART

Description

Run bart at test observations constructed so that a plot can be created displaying the effect of a single variable (pdbart) or pair of variables (pd2bart). Note the y is a binary with $P(Y=1 | x) =F(f(x))$ with $F$ the standard normal cdf, then the plots are all on the $f$ scale.

Usage

pdbart(
      x.train, y.train,
      xind=1:ncol(x.train), levs=NULL, levquants=c(.05,(1:9)/10,.95),
      pl=TRUE,  plquants=c(.05,.95), ...)
   # S3 method for pdbart
plot(
      x,
      xind = 1:length(x$fd),
      plquants =c(.05,.95),cols=c('black','blue'), ...)
   pd2bart(
      x.train, y.train,
      xind=1:2, levs=NULL, levquants=c(.05,(1:9)/10,.95),
      pl=TRUE, plquants=c(.05,.95), ...)
   # S3 method for pd2bart
plot(
      x,
      plquants =c(.05,.95), contour.color='white',
      justmedian=TRUE, ...)

Value

The plot methods produce the plots and don't return anything.

pdbart and pd2bart return lists with components given below. The list returned by pdbart is assigned class

‘pdbart’ and the list returned by pd2bart is assigned class ‘pd2bart’.

fd

A matrix whose $(i,j)$ value is the $i^{th}$ draw of $f_s(x_s)$ for the $j^{th}$ value of $x_s$. “fd” is for “function draws”.

For pdbart fd is actually a list whose $k^{th}$ component is the matrix described above corresponding to the $k^{th}$ variable chosen by argument xind.
The number of columns in each matrix will equal the number of values given in the corresponding component of argument levs (or number of values in levquants).

For pd2bart, fd is a single matrix. The columns correspond to all possible pairs of values for the pair of variables indicated by xind. That is, all possible $(x_i,x_j)$ where $x_i$ is a value in the levs component corresponding to the first $x$ and $x_j$ is a value in the levs components corresponding to the second one.
The first $x$ changes first.

levs

The list of levels used, each component corresponding to a variable.
If argument levs was supplied it is unchanged.
Otherwise, the levels in levs are as constructed using argument levquants.

xlbs

vector of character strings which are the plotting labels used for the variables.

The remaining components returned in the list are the same as in the value of bart. They are simply passed on from the BART run used to create the partial dependence plot. The function plot.bart can be applied to the object returned by pdbart or

pd2bart to examine the BART run.

Arguments

x.train: Explanatory variables for training (in sample) data.
Must be a matrix (typeof double) with (as usual) rows corresponding to observations and columns to variables.
Note that for a categorical variable you need to use dummies and if there are more than two categories, you need to put all the dummies in (unlike linear regression).
y.train: Dependent variable for training (in sample) data.
Must be a vector (typeof double) with length equal to the number of observations (equal to the number of rows of x.train).
xind: Integer vector indicating which variables are to be plotted.
In pdbart, variables (columns of x.train) for which plot is to be constructed.
In plot.pdbart, indices in list returned by pdbart for which plot is to be constructed.
In pd2bart, integer vector of length 2, indicating the pair of variables (columns of x.train) to plot.
levs: Gives the values of a variable at which the plot is to be constructed.
List, where $i^{th}$ component gives the values for $i^{th}$ variable.
In pdbart, should have same length as xind.
In pd2bart, should have length 2.
See also argument levquants.
levquants: If levs in NULL, the values of each variable used in the plot is set to the quantiles (in x.train) indicated by levquants.
Double vector.
pl: For pdbart and pd2bart, if true, plot is made (by calling plot.*).
plquants: In the plots, beliefs about $f(x)$ are indicated by plotting the posterior median and a lower and upper quantile. plquants is a double vector of length two giving the lower and upper quantiles.
...: Additional arguments.
In pdbart,pd2bart, passed on to bart.
In plot.pdbart, passed on to plot.
In plot.pd2bart, passed on to image
x: For plot.*, object returned from pdbart or pd2bart.
cols: Vector of two colors.
First color is for median of $f$, second color is for the upper and lower quantiles.
contour.color: Color for contours plotted on top of the image.
justmedian: Boolean, if true just one plot is created for the median of $f(x)$ draws. If false, three plots are created one for the median and two additional ones for the lower and upper quantiles. In this case, mfrow is set to c(1,3).

Author

Hugh Chipman: hugh.chipman@gmail.com.
Robert McCulloch: robert.e.mcculloch@gmail.com.

Details

We divide the predictor vector $x$ into a subgroup of interest, $x_s$ and the complement $x_c=x\setminus x_s$. A prediction $f(x)$ can then be written as $f(x_s,x_c)$. To estimate the effect of $x_s$ on the prediction, Friedman suggests the partial dependence function $$ f_s(x_s) = \frac{1}{n}\sum_{i=1}^n f(x_s,x_{ic}) $$ where $x_{ic}$ is the $i^{th}$ observation of $x_c$ in the data. Note that $(x_s,x_{ic})$ will generally not be one of the observed data points. Using BART it is straightforward to then estimate and even obtain uncertainty bounds for $f_s(x_s)$. A draw of $f^*_s(x_s)$ from the induced BART posterior on $f_s(x_s)$ is obtained by simply computing $f^*_s(x_s)$ as a byproduct of each MCMC draw $f^*$. The median (or average) of these MCMC draws $f^*_s(x_s)$ then yields an estimate of $f_s(x_s)$, and lower and upper quantiles can be used to obtain intervals for $f_s(x_s)$.

In pdbart $x_s$ consists of a single variable in $x$ and in pd2bart it is a pair of variables.

This is a computationally intensive procedure. For example, in pdbart, to compute the partial dependence plot for 5 $x_s$ values, we need to compute $f(x_s,x_c)$ for all possible $(x_s,x_{ic})$ and there would be $5n$ of these where $n$ is the sample size. All of that computation would be done for each kept BART draw. For this reason running BART with keepevery larger than 1 (eg. 10) makes the procedure much faster.

References

Chipman, H., George, E., and McCulloch R. (2010) Bayesian Additive Regression Trees. The Annals of Applied Statistics, 4,1, 266-298.

Examples

Run this code

##simulate data 
f = function(x) { return(.5*x[,1] + 2*x[,2]*x[,3]) }
sigma=.2 # y = f(x) + sigma*z
n=100 #number of observations
set.seed(27)
x = matrix(2*runif(n*3)-1,ncol=3) ; colnames(x) = c('rob','hugh','ed')
Ey = f(x)
y = Ey +  sigma*rnorm(n)
lmFit = lm(y~.,data.frame(x,y)) #compare lm fit to BART later
par(mfrow=c(1,3)) #first two for pdbart, third for pd2bart
##pdbart: one dimensional partial dependence plot
set.seed(99)
pdb1 = pdbart(x,y,xind=c(1,2),
   levs=list(seq(-1,1,.2),seq(-1,1,.2)),pl=FALSE,
   keepevery=10,ntree=100,nskip=100,ndpost=200) #should run longer!
plot(pdb1,ylim=c(-.6,.6))
##pd2bart: two dimensional partial dependence plot
set.seed(99)
pdb2 = pd2bart(x,y,xind=c(2,3),
   levquants=c(.05,.1,.25,.5,.75,.9,.95),pl=FALSE,
   ntree=100,keepevery=10,verbose=FALSE,nskip=100,ndpost=200) #should run longer!
plot(pdb2)
##compare BART fit to linear model and truth = Ey
fitmat = cbind(y,Ey,lmFit$fitted,pdb1$yhat.train.mean)
colnames(fitmat) = c('y','Ey','lm','bart')
print(cor(fitmat))
## plot.bart(pdb1) displays the BART run used to get the plot.

Run the code above in your browser using DataLab