trimTrees: Trimmed Opinion Pools of Trees in Random Forest

Description

This function creates point and probability forecasts from the trees in a random forest using Jose et al.'s trimmed opinion pool, a trimmed average of the trees' empirical cumulative distribution functions (cdf). For tuning purposes, the user can input the trimming level used in this trimmed average and then compare the scores of the trimmed and untrimmed opinion pools, or ensembles.

Usage

trimTrees(xtrain, ytrain, xtest, ytest, ntree = 500, 
          mtry = if (!is.null(ytrain) && !is.factor(ytrain)) 
          max(floor(ncol(xtrain)/3), 1) else floor(sqrt(ncol(xtrain))), 
          nodesize = if (!is.null(ytrain) && !is.factor(ytrain)) 5 else 1, 
          trim = 0,trimIsExterior = TRUE, 
          uQuantiles = seq(0.05, 0.95, 0.05), methodIsCDF = TRUE)

Arguments

xtrain

A data frame or a matrix of predictors for the training set.

ytrain

A response vector for the training set. If a factor, classification is assumed, otherwise regression is assumed.

xtest

A data frame or a matrix of predictors for the testing set.

ytest

A response vector for the testing set.

ntree

Number of trees to grow.

mtry

Number of variables randomly sampled as candidates at each split.

nodesize

Minimum size of terminal nodes.

trim

The trimming level used in the trimmed average of the trees' empirical cdfs. For the cdf approach, the trimming level is the fraction of cdfs values to be trimmed from each end of the ordered vector of cdf values (for each support point) before the avera

trimIsExterior

If TRUE, the trimming is done exteriorly, or from the ends of the ordered vector. If FALSE, the trimming is done interiorly, or from the middle of the ordered vector.

uQuantiles

A vector of probabilities in a strictly increasing order and between 0 and 1. For instance, if uQuantiles=c(0.25,0.75), then the 0.25-quantile and the 0.75-quantile of the trimmed and untrimmed ensembles are scored.

methodIsCDF

If TRUE, the method for forming the trimmed opinion pool is according to the cdf approach in Jose et al (2014). If FALSE, the moment approach is used.

Value

An object of class trimTrees, which is a list with the following components:
forestSupportPossible points of support for the trees and ensembles.
treeValuesFor the last testing set row, this component outputs each tree's ytrain values (not necessarily unique) that are both inbag and in the xtest's terminal node. Note that the ytrain values may not be unique. This component is an ntrain-by-ntree matrix where ntrain is the number of rows in the training set.
treeCountsFor the last testing set row, each tree's counts of treeValues and lists them by their unique values. This component is an nSupport-by-ntree matrix. nSupport is the number of unique ytrain values, or support points of the forest.
treeCumCountsCumulative tally of treeCounts of dimension (nSupport+1)-by-ntree.
treeCDFsEach tree's empirical cdf based on treeCumCounts for the last testing set row only. This component is an (nSupport+1)-by-ntree matrix. Note that the first row in this matrix is all zeros.
treePMFsEach tree's empirical probability mass function (pmf) for the last testing set row. This component is an nSupport-by-ntree matrix.
treeMeansFor each testing set row, each tree's mean according to its empirical pmf. This component is an ntest-by-ntree matrix where ntest is the number of rows in the testing set.
treeVarsFor each testing set row, each tree's variance according to its empirical pmf. This component is an ntest-by-ntree matrix.
treePITsFor each testing set row, each tree's probability integral transform (PIT), the empirical cdf evaluated at the realized ytest value. This component is a ntest-by-ntree matrix.
treeQuantilesFor the last testing set row, each tree's quantiles -- one for each element in uQuantiles, the empirical cdf evaluated at the realized ytest value. This component is an ntree-by-nQuantile matrix where nQuantile is the number of elements in uQuantiles.
treeFirstPMFValuesFor each testing set row, this component outputs the pmf value on the minimum (or first) support point in the forest. For binary classification, this corresponds to the probability that the minimum (or first) support point will occur. This component's dimension is ntest-by-ntree. It is useful for generating calibration curves (stated probabilities in bins vs. their observed frequencies) for binary classification.
bracketingRateFor each testing set row, the bracketing rate from Larrick et al. (2012) is computed as 2*p*(1-p) where p is the fraction of trees' means above the ytest value.
bracketingRateAllPairsThe average bracketing rate across all testing set rows for each pair of trees. This component is a symmetric ntree-by-ntree matrix.
rfClassEnsembleCDFsFor classification, this component contains the cdf based on a pmf that is the renormalized vector of 'vote' counts from the trees in the forest. The pmf vector comes from a call to predict.randomForest with type="prob".
rfClassEnsembleQuantilesThis component contains the quantiles of rfClassEnsembleCDFs for each element in uQuantiles.
rfClassEnsembleQuantilesThis component contains the quantiles of rfClassEnsembleCDFs for each element in uQuantiles.
rfClassEnsembleComponentScoresFor the last testing set row, this output contains the components of the linear and log quantile scores of rfClassEnsembleCDFs. See Jose and Winkler (2009) for a description of the linear and log quantile scores.
rfClassEnsembleScoresFor each testing set row, this output contains the linear and log quantile scores and the rank probability score of rfClassEnsembleCDFs. See Gneiting and Raftery (2007) for a description of the ranked probability score.
trimmedEnsembleCDFsFor each testing set row, the trimmed ensemble's forecast of ytest in the form of a cdf.
trimmedEnsemblePMFsFor each testing set row, the trimmed ensemble's pmf.
trimmedEnsembleMeansFor each testing set row, the trimmed ensemble's mean.
trimmedEnsembleVarsFor each testing set row, the trimmed ensemble's variance.
trimmedEnsembleQuantilesFor the last testing set row, the trimmed ensemble's quantiles -- one for each element in uQuantiles.
trimmedEnsembleComponentScoresFor the last testing set row, the components of the trimmed ensemble's linear and log quantile scores.
trimmedEnsembleScoresFor each testing set row, the trimmed ensemble's linear and log quantile scores, ranked probability score, and two-moment score. The two-moment score here is the score in Equation 27 of Gneiting and Raftery (2007).
untrimmedEnsembleCDFsFor each testing set row, the linear opinion pool's, or untrimmed ensemble's, forecast of ytest in the form of a cdf.
untrimmedEnsemblePMFsFor each testing set row, the untrimmed ensemble's pmf.
untrimmedEnsembleMeansFor each testing set row, the untrimmed ensemble's mean.
untrimmedEnsembleVarsFor each testing set row, the untrimmed ensemble's variance.
untrimmedEnsembleQuantilesFor the last testing set row, the untrimmed ensemble's quantiles -- one for each element in uQuantiles.
untrimmedEnsembleComponentScoresFor the last testing set row, the components of the untrimmed ensemble's linear and log quantile scores.
untrimmedEnsembleScoresFor each testing set row, the untrimmed ensemble's linear and log quantile scores, ranked probability score, and two-moment score.

References

Gneiting T, Raftery AE. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102 359-378. Jose VRR, Grushka-Cockayne Y, Lichtendahl KC Jr. (2014). Trimmed opinion pools and the crowd's calibration problem. Management Science 60 463-475. Jose VRR, Winkler RL (2009). Evaluating quantile assessments. Operations Research 57 1287-1297. Grushka-Cockayne Y, Jose VRR, Lichtendahl KC Jr. (2014). Ensembling overfit and overconfident forecasts, working paper. Larrick RP, Mannes AE, Soll JB (2011). The social psychology of the wisdom of crowds. In J.I. Krueger, ed., Frontiers in Social Psychology: Social Judgment and Decision Making. New York: Psychology Press, 227-242.

Examples

Run this code

# Load the data
set.seed(201) # Can be removed; useful for replication
data <- as.data.frame(mlbench.friedman1(500, sd=1))
summary(data)

# Prepare data for trimming
train <- data[1:400, ]
test <- data[401:500, ]
xtrain <- train[,-11]  
ytrain <- train[,11]
xtest <- test[,-11]
ytest <- test[,11]
      
# Run trimTrees
set.seed(201) # Can be removed; useful for replication
trimming <- trimTrees(xtrain, ytrain, xtest, ytest,trim=0.15)

# Outputs from trimTrees
colMeans(trimming$trimmedEnsembleScores)
colMeans(trimming$untrimmedEnsembleScores)
mean(hitRate(trimming$treePITs))
hitRate(trimming$trimmedEnsemblePITs)
hitRate(trimming$untrimmedEnsemblePITs)
hist(trimming$trimmedEnsemblePITs, prob=TRUE)
hist(trimming$untrimmedEnsemblePITs, prob=TRUE)