quantForestError: Quantify random forest prediction error

Description

Estimates the conditional misclassification rates, conditional mean squared prediction errors, conditional biases, conditional prediction intervals, and conditional error distributions of random forest predictions.

Usage

quantForestError(
  forest,
  X.train,
  X.test,
  Y.train = NULL,
  what = if (grepl("class", c(forest$type, forest$family, forest$treetype), TRUE))
    "mcr" else c("mspe", "bias", "interval", "p.error", "q.error"),
  alpha = 0.05,
  train_nodes = NULL,
  return_train_nodes = FALSE,
  n.cores = 1
)

Arguments

forest

The random forest object being used for prediction.

X.train

A matrix or data.frame with the observations that were used to train forest. Each row should be an observation, and each column should be a predictor variable.

X.test

A matrix or data.frame with the observations to be predicted; each row should be an observation, and each column should be a predictor variable.

Y.train

A vector of the responses of the observations that were used to train forest. Required if forest was created using ranger, but not if forest was created using randomForest, randomForestSRC, or quantregForest.

what

A vector of characters indicating what estimates are desired. Possible options are conditional mean squared prediction errors ("mspe"), conditional biases ("bias"), conditional prediction intervals ("interval"), conditional error distribution functions ("p.error"), conditional error quantile functions ("q.error"), and conditional misclassification rate ("mcr"). Note that the conditional misclassification rate is available only for categorical outcomes, while the other parameters are available only for real-valued outcomes.

alpha

A vector of type-I error rates desired for the conditional prediction intervals; required if "interval" is included in what.

train_nodes

A data.table indicating what out-of-bag prediction errors each terminal node of each tree in forest contains. It should be formatted like the output of findOOBErrors. If not provided, it will be computed internally.

return_train_nodes

A boolean indicating whether to return the train_nodes computed and/or used.

n.cores

Number of cores to use (for parallel computation in ranger).

Value

A data.frame with one or more of the following columns, as described in the details section:

pred

The random forest predictions of the test observations

mspe

The estimated conditional mean squared prediction errors of the random forest predictions

bias

The estimated conditional biases of the random forest predictions

lower_alpha

The estimated lower bounds of the conditional alpha-level prediction intervals for the test observations

upper_alpha

The estimated upper bounds of the conditional alpha-level prediction intervals for the test observations

mcr

The estimated conditional misclassification rate of the random forest predictions

In addition, one or both of the following functions, as described in the details section:

perror

The estimated cumulative distribution functions of the conditional error distributions associated with the test predictions

qerror

The estimated quantile functions of the conditional error distributions associated with the test predictions

In addition, if return_train_nodes is TRUE, then a data.table called train_nodes indicating what out-of-bag prediction errors each terminal node of each tree in forest contains.

Details

This function accepts classification or regression random forests built using the randomForest, ranger, randomForestSRC, and quantregForest packages. When training the random forest using randomForest, ranger, or quantregForest, keep.inbag must be set to TRUE. When training the random forest using randomForestSRC, membership must be set to TRUE.

The predictions computed by ranger can be parallelized by setting the value of n.cores to be greater than 1.

The random forest predictions are always returned as a data.frame. Additional columns are included in the data.frame depending on the user's selections in the argument what. In particular, including "mspe" in what will add an additional column with the conditional mean squared prediction error of each test prediction to the data.frame; including "bias" in what will add an additional column with the conditional bias of each test prediction to the data.frame; including "interval" in what will add to the data.frame additional columns with the lower and upper bounds of conditional prediction intervals for each test prediction; and including "mcr" in what will add an additional column with the conditional misclassification rate of each test prediction to the data.frame. The conditional misclassification rate can be estimated only for classification random forests, while the other parameters can be estimated only for regression random forests.

If "p.error" or "q.error" is included in what, or if return_train_nodes is set to TRUE, then a list will be returned as output. The first element of the list, named "estimates", is the data.frame described in the above paragraph. The other elements of the list are the estimated cumulative distribution functions (perror) of the conditional error distributions, the estimated quantile functions (qerror) of the conditional error distributions, and/or a data.table indicating what out-of-bag prediction errors each terminal node of each tree in the random forest contains.

Examples

Run this code

# NOT RUN {
# load data
data(airquality)

# remove observations with missing predictor variable values
airquality <- airquality[complete.cases(airquality), ]

# get number of observations and the response column index
n <- nrow(airquality)
response.col <- 1

# split data into training and test sets
train.ind <- sample(c("A", "B", "C"), n,
                    replace = TRUE, prob = c(0.8, 0.1, 0.1))
Xtrain <- airquality[train.ind == "A", -response.col]
Ytrain <- airquality[train.ind == "A", response.col]
Xtest1 <- airquality[train.ind == "B", -response.col]
Xtest2 <- airquality[train.ind == "C", -response.col]

# fit regression random forest to the training data
rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5,
                                 ntree = 500,
                                 keep.inbag = TRUE)

# estimate conditional mean squared prediction errors,
# biases, prediction intervals, and error distribution
# functions for the observations in Xtest1. return
# train_nodes to avoid recomputation in the next
# line of code.
output1 <- quantForestError(rf, Xtrain, Xtest1,
                            return_train_nodes = TRUE)

# estimate just the conditional mean squared prediction errors
# and prediction intervals for the observations in Xtest2.
# avoid recomputation by providing train_nodes from the
# previous line of code.
output2 <- quantForestError(rf, Xtrain, Xtest2,
                            what = c("mspe", "interval"),
                            train_nodes = output1$train_nodes)

# for illustrative purposes, convert response to categorical
Ytrain <- as.factor(Ytrain > 31.5)

# fit classification random forest to the training data
rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 3,
                                 ntree = 500,
                                 keep.inbag = TRUE)

# estimate conditional misclassification rate of the
# predictions of Xtest1
output <- quantForestError(rf, Xtrain, Xtest1)

# }

Run the code above in your browser using DataLab