pgbart_predict: Make Predictions Using Bayesian Additive Regression Trees

Description

Make predictions for a new test data set after building a model using trainng data by function pgbart_train.

Usage

pgbart_predict(x.test, model)

Arguments

x.test

Explanatory variables for test (out of sample) data. Should have same structure as x.train in pgbart_train.

model

The path to save the model file as specified in pgbart_train.

Value

pgbart_predict returns a list assigned class ‘pgbart’. In the numeric \(y\) case, the list has components:

yhat.test

A matrix with (ndpost/keepevery) rows and nrow(x.test) columns. Each row corresponds to a draw \(f^*\) from the posterior of \(f\) and each column corresponds to a row of x.test. The \((i,j)\) value is \(f^*(x)\) for the \(i^{th}\) kept draw of \(f\) and the \(j^{th}\) row of x.test. Burn-in is dropped.

yhat.test.mean

Test data fits = mean of yhat.test columns. Only exists when \(y\) is not binary.

In the binary y case, the returned list has the components yhat.test and binaryOffset.

Note that in the binary y case, yhat.test is f(x) + binaryOffset. If you want draws of the probability P(Y=1 | x) you need to apply the normal cdf (pnorm) to these values.

Details

PGBART is an Bayesian MCMC method. At each MCMC interation, we produce a draw from the joint posterior \((f,\sigma) | (x,y)\) in the numeric \(y\) case and just \(f\) in the binary \(y\) case.

Thus, unlike a lot of other modelling methods in R, we do not produce a single model object from which fits and summaries may be extracted. The output consists of values \(f^*(x)\) (and \(\sigma^*\) in the numeric case) where * denotes a particular draw. The \(x\) is a row from the test data (x.test).

References

Chipman, H., George, E., and McCulloch R. (2010) Bayesian Additive Regression Trees. The Annals of Applied Statistics, 4,1, 266-298.

Lakshminarayanan B, Roy D, Teh Y W. (2015) Particle Gibbs for Bayesian Additive Regression Trees Artificial Intelligence and Statistics, 553-561.

Chipman, H., George, E., and McCulloch R. (2006) Bayesian Ensemble Learning. Advances in Neural Information Processing Systems 19, Scholkopf, Platt and Hoffman, Eds., MIT Press, Cambridge, MA, 265-272.

Friedman, J.H. (1991) Multivariate Adaptive Regression Splines. The Annals of Statistics, 19, 1--67.

Breiman, L. (1996) Bias, Variance, and Arcing Classifiers. Tech. Rep. 460, Statistics Department, University of California, Berkeley, CA, USA.

Examples

Run this code

# NOT RUN {
##Example 1: simulated continuous outcome data (example from section 4.3 of Friedman's MARS paper)
f = function(x){
    10*sin(pi*x[,1]*x[,2]) + 20*(x[,3]-.5)^2+10*x[,4]+5*x[,5]
}
sigma = 1.0  #y = f(x) + sigma*z , z~N(0,1)
n = 100      #number of observations
set.seed(99)
x = matrix(runif(n*10), n, 10) 
Ey = f(x)
y = Ey+sigma*rnorm(n)
model_path = file.path(tempdir(),'pgbart.model')
pgbartFit = pgbart_train(x[1:(n*.75),], y[1:(n*.75)], 
                         model=model_path,
                         ndpost=200, ntree=5, usepg=TRUE)

pgbartPredict = pgbart_predict(x[(n*.75+1):n,], model=model_path)

cor(pgbartPredict$yhat.test.mean, y[(n*.75+1):n])

##Example 2: simulated binary outcome data (two normal example from Breiman) 
f <- function (n, d = 20) 
{
  x <- matrix(0, nrow = n, ncol = d)
  c1 <- sample.int(n, n/2)
  c2 <- (1:n)[-c1]
  a <- 2/sqrt(d)
  x[c1, ] <- matrix(rnorm(n = d * length(c1), mean = -a), ncol = d)
  x[c2, ] <- matrix(rnorm(n = d * length(c2), mean = a), ncol = d)
  
  x.train <- x
  y.train <- rep(0, n)
  y.train[c2] <- 1
  list(x.train=x.train, y.train=as.factor(y.train))
}


set.seed(99)
n <- 200
train <- f(n)
model_path = file.path(tempdir(),'pgbart.model')
pgbartFit = pgbart_train(train$x.train[1:(n*.75),], train$y.train[1:(n*.75)], 
                         model=model_path, ndpost=200, ntree=5, usepg=TRUE)
pgbartPredict = pgbart_predict(train$x.train[(n*.75+1):n,], model=model_path)
class.pred = ifelse(colMeans(apply(pgbartPredict$yhat.test, 2, pnorm)) <= 0.5, 0, 1)
table(class.pred, train$y.train[(n*.75+1):n])
# }

Run the code above in your browser using DataLab