trimTrees: Trimmed Opinion Pools of Trees in Random Forest

Description

This function creates point and probability forecasts from the trees in a random forest using Jose et al.'s trimmed opinion pool, a trimmed average of the trees' empirical cumulative distribution functions (cdf). For tuning purposes, the user can input the trimming level used in this trimmed average and then compare the scores of the trimmed and untrimmed opinion pools, or ensembles.

Usage

trimTrees(xtrain, ytrain, xtest, ytest=NULL, ntree = 500,  mtry = if (!is.null(ytrain) && !is.factor(ytrain))  max(floor(ncol(xtrain)/3), 1) else floor(sqrt(ncol(xtrain))),  nodesize = if (!is.null(ytrain) && !is.factor(ytrain)) 5 else 1,  trim = 0,trimIsExterior = TRUE,  uQuantiles = seq(0.05, 0.95, 0.05), methodIsCDF = TRUE)

Arguments

xtrain

A data frame or a matrix of predictors for the training set.

ytrain

A response vector for the training set. If a factor, classification is assumed, otherwise regression is assumed.

xtest

A data frame or a matrix of predictors for the testing set.

ytest

A response vector for the testing set. If no testing set is passed, probability integral transform (PIT) values and scores will be returned as NAs.

ntree

Number of trees to grow.

mtry

Number of variables randomly sampled as candidates at each split.

nodesize

Minimum size of terminal nodes.

trim

The trimming level used in the trimmed average of the trees' empirical cdfs. For the cdf approach, the trimming level is the fraction of cdfs values to be trimmed from each end of the ordered vector of cdf values (for each support point) before the average is computed. For the moment approach, the trees' means are computed, ordered, and trimmed. The trimmed opinion pool using the moment approach is an average of the remaining trees.

trimIsExterior

If TRUE, the trimming is done exteriorly, or from the ends of the ordered vector. If FALSE, the trimming is done interiorly, or from the middle of the ordered vector.

uQuantiles

A vector of probabilities in a strictly increasing order and between 0 and 1. For instance, if uQuantiles=c(0.25,0.75), then the 0.25-quantile and the 0.75-quantile of the trimmed and untrimmed ensembles are scored.

methodIsCDF

If TRUE, the method for forming the trimmed opinion pool is according to the cdf approach in Jose et al (2014). If FALSE, the moment approach is used.

Value

forestSupport: Possible points of support for the trees and ensembles.
treeValues: For the last testing set row, this component outputs each tree's ytrain values (not necessarily unique) that are both inbag and in the xtest's terminal node. Note that the ytrain values may not be unique. This component is an ntrain-by-ntree matrix where ntrain is the number of rows in the training set.
treeCounts: For the last testing set row, each tree's counts of treeValues and lists them by their unique values. This component is an nSupport-by-ntree matrix. nSupport is the number of unique ytrain values, or support points of the forest.
treeCumCounts: Cumulative tally of treeCounts of dimension nSupport+1-by-ntree.
treeCDFs: Each tree's empirical cdf based on treeCumCounts for the last testing set row only. This component is an nSupport+1-by-ntree matrix. Note that the first row in this matrix is all zeros.
treePMFs: Each tree's empirical probability mass function (pmf) for the last testing set row. This component is an nSupport-by-ntree matrix.
treeMeans: For each testing set row, each tree's mean according to its empirical pmf. This component is an ntest-by-ntree matrix where ntest is the number of rows in the testing set.
treeVars: For each testing set row, each tree's variance according to its empirical pmf. This component is an ntest-by-ntree matrix.
treePITs: For each testing set row, each tree's probability integral transform (PIT), the empirical cdf evaluated at the realized ytest value. This component is an ntest-by-ntree matrix. If ytest is NULL, NAs are returned.
treeQuantiles: For the last testing set row, each tree's quantiles -- one for each element in uQuantiles, the empirical cdf evaluated at the realized ytest value. This component is an ntree-by-nQuantile matrix where nQuantile is the number of elements in uQuantiles.
treeFirstPMFValues: For each testing set row, this component outputs the pmf value on the minimum (or first) support point in the forest. For binary classification, this corresponds to the probability that the minimum (or first) support point will occur. This component's dimension is ntest-by-ntree. It is useful for generating calibration curves (stated probabilities in bins vs. their observed frequencies) for binary classification.
bracketingRate: For each testing set row, the bracketing rate from Larrick et al. (2012) is computed as 2*p*(1-p) where p is the fraction of trees' means above the ytest value. If ytest is NULL, NAs are returned.
bracketingRateAllPairs: The average bracketing rate across all testing set rows for each pair of trees. This component is a symmetric ntree-by-ntree matrix. If ytest is NULL, NAs are returned.
trimmedEnsembleCDFs: For each testing set row, the trimmed ensemble's forecast of ytest in the form of a cdf. This component is an ntest-by-nSupport + 1 matrix. nSupport is the number of unique ytrain values, or support points of the forest.
trimmedEnsemblePMFs: For each testing set row, the trimmed ensemble's pmf. This component is an ntest-by-nSupport matrix.
trimmedEnsembleMeans: For each testing set row, the trimmed ensemble's mean. This component is an ntest vector.
trimmedEnsembleVars: For each testing set row, the trimmed ensemble's variance.
trimmedEnsemblePITs: For each testing set row, the trimmed ensemble's probability integral transform (PIT), the empirical cdf evaluated at the realized ytest value. If ytest is NULL, NAs are returned.
trimmedEnsembleQuantiles: For the last testing set row, the trimmed ensemble's quantiles -- one for each element in uQuantiles.
trimmedEnsembleComponentScores: For the last testing set row, the components of the trimmed ensemble's linear and log quantile scores.If ytest is NULL, NAs are returned.
trimmedEnsembleScores: For each testing set row, the trimmed ensemble's linear and log quantile scores, ranked probability score, and two-moment score. See Jose and Winkler (2009) for a description of the linear and log quantile scores. See Gneiting and Raftery (2007) for a description of the ranked probability score. The two-moment score is the score in Equation 27 of Gneiting and Raftery (2007). If ytest is NULL, NAs are returned.
untrimmedEnsembleCDFs: For each testing set row, the linear opinion pool's, or untrimmed ensemble's, forecast of ytest in the form of a cdf.
untrimmedEnsemblePMFs: For each testing set row, the untrimmed ensemble's pmf.
untrimmedEnsembleMeans: For each testing set row, the untrimmed ensemble's mean.
untrimmedEnsembleVars: For each testing set row, the untrimmed ensemble's variance.
untrimmedEnsemblePITs: For each testing set row, the untrimmed ensemble's probability integral transform (PIT), the empirical cdf evaluated at the realized ytest value. If ytest is NULL, NAs are returned.
untrimmedEnsembleQuantiles: For the last testing set row, the untrimmed ensemble's quantiles -- one for each element in uQuantiles.
untrimmedEnsembleComponentScores: For the last testing set row, the components of the untrimmed ensemble's linear and log quantile scores. If ytest is NULL, NAs are returned.
untrimmedEnsembleScores: For each testing set row, the untrimmed ensemble's linear and log quantile scores, ranked probability score, and two-moment score. If ytest is NULL, NAs are returned.

References

Gneiting T, Raftery AE. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102 359-378.

Jose VRR, Grushka-Cockayne Y, Lichtendahl KC Jr. (2014). Trimmed opinion pools and the crowd's calibration problem. Management Science 60 463-475.

Jose VRR, Winkler RL (2009). Evaluating quantile assessments. Operations Research 57 1287-1297.

Grushka-Cockayne Y, Jose VRR, Lichtendahl KC Jr. (2014). Ensembles of overfit and overconfident forecasts, working paper.

Larrick RP, Mannes AE, Soll JB (2011). The social psychology of the wisdom of crowds. In J.I. Krueger, ed., Frontiers in Social Psychology: Social Judgment and Decision Making. New York: Psychology Press, 227-242.

Examples

Run this code

# Load the data
set.seed(201) # Can be removed; useful for replication
data <- as.data.frame(mlbench.friedman1(500, sd=1))
summary(data)

# Prepare data for trimming
train <- data[1:400, ]
test <- data[401:500, ]
xtrain <- train[,-11]  
ytrain <- train[,11]
xtest <- test[,-11]
ytest <- test[,11]
      
# Option 1. Run trimTrees with responses in testing set.
set.seed(201) # Can be removed; useful for replication
tt1 <- trimTrees(xtrain, ytrain, xtest, ytest, trim=0.15)

#Some outputs from trimTrees: scores, hit rates, PIT densities.
colMeans(tt1$trimmedEnsembleScores)
colMeans(tt1$untrimmedEnsembleScores)
mean(hitRate(tt1$treePITs))
hitRate(tt1$trimmedEnsemblePITs)
hitRate(tt1$untrimmedEnsemblePITs)
hist(tt1$trimmedEnsemblePITs, prob=TRUE)
hist(tt1$untrimmedEnsemblePITs, prob=TRUE)

# Option 2. Run trimTrees without responses in testing set. 
# In this case, scores, PITs, or hit rates will not be available.
set.seed(201) # Can be removed; useful for replication
tt2 <- trimTrees(xtrain, ytrain, xtest, trim=0.15)

# Some outputs from trimTrees: cdfs for last test value.
plot(tt2$trimmedEnsembleCDFs[100,],type="l",col="red",ylab="cdf",xlab="y") 
lines(tt2$untrimmedEnsembleCDFs[100,])
legend(275,0.2,c("trimmed", "untrimmed"),col=c("red","black"),lty = c(1, 1))
title("CDFs of Trimmed and Untrimmed Ensembles")

# Compare the CDF and moment approaches to trimming the trees.
ttCDF <- trimTrees(xtrain, ytrain, xtest, trim=0.15, methodIsCDF=TRUE)
ttMA <- trimTrees(xtrain, ytrain, xtest, trim=0.15, methodIsCDF=FALSE)
plot(ttCDF$trimmedEnsembleCDFs[100,], type="l", col="red", ylab="cdf", xlab="y") 
lines(ttMA$trimmedEnsembleCDFs[100,])
legend(275,0.2,c("CDF Approach", "Moment Approach"), col=c("red","black"),lty = c(1, 1))
title("CDFs of Trimmed Ensembles")