quantForestError: Quantify random forest prediction error

Description

Estimates the conditional mean squared prediction errors, conditional biases, conditional prediction intervals, and conditional error distributions of random forest predictions.

Usage

quantForestError(forest, X.train, X.test, Y.train = NULL,
  what = c("mspe", "bias", "interval", "p.error", "q.error"),
  alpha = 0.05, n.cores = 1)

Arguments

forest

The random forest object being used for prediction.

X.train

A matrix or data.frame with the observations that were used to train forest; each row should be an observation, and each column should be a predictor variable.

X.test

A matrix or data.frame with the observations to be predicted; each row should be an observation, and each column should be a predictor variable.

Y.train

A vector of the responses of the observations that were used to train forest. Required if forest was created using ranger, but not if forest was created using randomForest, randomForestSRC, or quantregForest.

what

A vector of characters indicating what estimates are desired. Possible options are conditional mean squared prediction errors ("mspe"), conditional biases ("bias"), conditional prediction intervals ("interval"), conditional error distribution functions ("p.error"), and conditional error quantile functions ("q.error").

alpha

The type-I error rate desired for the conditional prediction intervals; required if "interval" is included in what.

n.cores

Number of cores to use (for parallel computation).

Value

A data.frame with one or more of the following columns, as described in the details section:

pred

The random forest predictions of the test observations

mspe

The estimated conditional mean squared prediction errors of the random forest predictions

bias

The estimated conditional biases of the random forest predictions

lower

The estimated lower bounds of the conditional prediction intervals for the test observations

upper

The estimated upper bounds of the conditional prediction intervals for the test observations

In addition, one or both of the following functions, as described in the details section:

perror

The estimated cumulative distribution functions of the conditional error distributions associated with the test predictions

qerror

The estimated quantile functions of the conditional error distributions associated with the test predictions

Details

When training the random forest using randomForest, ranger, or quantregForest, keep.inbag must be set to TRUE. When training the random forest using randomForestSRC, membership must be set to TRUE.

The computation can be parallelized by setting the value of n.cores to be greater than 1.

The random forest predictions are always returned as a data.frame. Additional columns are included in the data.frame depending on the user's selections in the argument what. In particular, including "mspe" in what will add an additional column with the conditional mean squared prediction error of each test prediction to the data.frame; including "bias" in what will add an additional column with the conditional bias of each test prediction to the data.frame; and including "interval" in what will add to the data.frame two additional columns with the lower and upper bounds of a conditional prediction interval for each test prediction.

If "p.error" or "q.error" is included in what, then a list will be returned as output. The first element of the list, named "estimates", is the data.frame described in the above paragraph. The other one or two elements of the list are the estimated cumulative distribution functions (perror) and/or the estimated quantile functions (qerror) of the conditional error distributions associated with the test predictions.

Examples

Run this code

# NOT RUN {
# load data
data(airquality)

# remove observations with missing predictor variable values
airquality <- airquality[complete.cases(airquality), ]

# get number of observations and the response column index
n <- nrow(airquality)
response.col <- 1

# split data into training and test sets
train.ind <- sample(1:n, n * 0.9, replace = FALSE)
Xtrain <- airquality[train.ind, -response.col]
Ytrain <- airquality[train.ind, response.col]
Xtest <- airquality[-train.ind, -response.col]
Ytest <- airquality[-train.ind, response.col]

# fit random forest to the training data
rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5,
                                 ntree = 500,
                                 keep.inbag = TRUE)

# estimate conditional mean squared prediction errors,
# biases, prediction intervals, and error distribution
# functions for the test observations
output <- quantForestError(rf, Xtrain, Xtest,
                           alpha = 0.05)

# do the same as above but in parallel
output <- quantForestError(rf, Xtrain, Xtest, alpha = 0.05,
                           n.cores = 2)

# estimate just the conditional mean squared prediction errors
# and prediction intervals for the test observations
output <- quantForestError(rf, Xtrain, Xtest,
                           what = c("mspe", "interval"),
                           alpha = 0.05)

# estimate just the conditional error distribution
# functions for the test observations
output <- quantForestError(rf, Xtrain, Xtest,
                           what = c("p.error", "q.error"))
# }