Learn R Programming

trimTrees (version 1.2)

trimTrees: Trimmed Opinion Pools of Trees in Random Forest

Description

This function creates point and probability forecasts from the trees in a random forest using Jose et al.'s trimmed opinion pool, a trimmed average of the trees' empirical cumulative distribution functions (cdf). For tuning purposes, the user can input the trimming level used in this trimmed average and then compare the scores of the trimmed and untrimmed opinion pools, or ensembles.

Usage

trimTrees(xtrain, ytrain, xtest, ytest=NULL, ntree = 500, mtry = if (!is.null(ytrain) && !is.factor(ytrain)) max(floor(ncol(xtrain)/3), 1) else floor(sqrt(ncol(xtrain))), nodesize = if (!is.null(ytrain) && !is.factor(ytrain)) 5 else 1, trim = 0,trimIsExterior = TRUE, uQuantiles = seq(0.05, 0.95, 0.05), methodIsCDF = TRUE)

Arguments

xtrain
A data frame or a matrix of predictors for the training set.
ytrain
A response vector for the training set. If a factor, classification is assumed, otherwise regression is assumed.
xtest
A data frame or a matrix of predictors for the testing set.
ytest
A response vector for the testing set. If no testing set is passed, probability integral transform (PIT) values and scores will be returned as NAs.
ntree
Number of trees to grow.
mtry
Number of variables randomly sampled as candidates at each split.
nodesize
Minimum size of terminal nodes.
trim
The trimming level used in the trimmed average of the trees' empirical cdfs. For the cdf approach, the trimming level is the fraction of cdfs values to be trimmed from each end of the ordered vector of cdf values (for each support point) before the average is computed. For the moment approach, the trees' means are computed, ordered, and trimmed. The trimmed opinion pool using the moment approach is an average of the remaining trees.
trimIsExterior
If TRUE, the trimming is done exteriorly, or from the ends of the ordered vector. If FALSE, the trimming is done interiorly, or from the middle of the ordered vector.
uQuantiles
A vector of probabilities in a strictly increasing order and between 0 and 1. For instance, if uQuantiles=c(0.25,0.75), then the 0.25-quantile and the 0.75-quantile of the trimmed and untrimmed ensembles are scored.
methodIsCDF
If TRUE, the method for forming the trimmed opinion pool is according to the cdf approach in Jose et al (2014). If FALSE, the moment approach is used.

Value

An object of class trimTrees, which is a list with the following components:
forestSupport
Possible points of support for the trees and ensembles.
treeValues
For the last testing set row, this component outputs each tree's ytrain values (not necessarily unique) that are both inbag and in the xtest's terminal node. Note that the ytrain values may not be unique. This component is an ntrain-by-ntree matrix where ntrain is the number of rows in the training set.
treeCounts
For the last testing set row, each tree's counts of treeValues and lists them by their unique values. This component is an nSupport-by-ntree matrix. nSupport is the number of unique ytrain values, or support points of the forest.
treeCumCounts
Cumulative tally of treeCounts of dimension nSupport+1-by-ntree.
treeCDFs
Each tree's empirical cdf based on treeCumCounts for the last testing set row only. This component is an nSupport+1-by-ntree matrix. Note that the first row in this matrix is all zeros.
treePMFs
Each tree's empirical probability mass function (pmf) for the last testing set row. This component is an nSupport-by-ntree matrix.
treeMeans
For each testing set row, each tree's mean according to its empirical pmf. This component is an ntest-by-ntree matrix where ntest is the number of rows in the testing set.
treeVars
For each testing set row, each tree's variance according to its empirical pmf. This component is an ntest-by-ntree matrix.
treePITs
For each testing set row, each tree's probability integral transform (PIT), the empirical cdf evaluated at the realized ytest value. This component is an ntest-by-ntree matrix. If ytest is NULL, NAs are returned.
treeQuantiles
For the last testing set row, each tree's quantiles -- one for each element in uQuantiles, the empirical cdf evaluated at the realized ytest value. This component is an ntree-by-nQuantile matrix where nQuantile is the number of elements in uQuantiles.
treeFirstPMFValues
For each testing set row, this component outputs the pmf value on the minimum (or first) support point in the forest. For binary classification, this corresponds to the probability that the minimum (or first) support point will occur. This component's dimension is ntest-by-ntree. It is useful for generating calibration curves (stated probabilities in bins vs. their observed frequencies) for binary classification.
bracketingRate
For each testing set row, the bracketing rate from Larrick et al. (2012) is computed as 2*p*(1-p) where p is the fraction of trees' means above the ytest value. If ytest is NULL, NAs are returned.
bracketingRateAllPairs
The average bracketing rate across all testing set rows for each pair of trees. This component is a symmetric ntree-by-ntree matrix. If ytest is NULL, NAs are returned.
trimmedEnsembleCDFs
For each testing set row, the trimmed ensemble's forecast of ytest in the form of a cdf. This component is an ntest-by-nSupport + 1 matrix. nSupport is the number of unique ytrain values, or support points of the forest.
trimmedEnsemblePMFs
For each testing set row, the trimmed ensemble's pmf. This component is an ntest-by-nSupport matrix.
trimmedEnsembleMeans
For each testing set row, the trimmed ensemble's mean. This component is an ntest vector.
trimmedEnsembleVars
For each testing set row, the trimmed ensemble's variance.
trimmedEnsemblePITs
For each testing set row, the trimmed ensemble's probability integral transform (PIT), the empirical cdf evaluated at the realized ytest value. If ytest is NULL, NAs are returned.
trimmedEnsembleQuantiles
For the last testing set row, the trimmed ensemble's quantiles -- one for each element in uQuantiles.
trimmedEnsembleComponentScores
For the last testing set row, the components of the trimmed ensemble's linear and log quantile scores.If ytest is NULL, NAs are returned.
trimmedEnsembleScores
For each testing set row, the trimmed ensemble's linear and log quantile scores, ranked probability score, and two-moment score. See Jose and Winkler (2009) for a description of the linear and log quantile scores. See Gneiting and Raftery (2007) for a description of the ranked probability score. The two-moment score is the score in Equation 27 of Gneiting and Raftery (2007). If ytest is NULL, NAs are returned.
untrimmedEnsembleCDFs
For each testing set row, the linear opinion pool's, or untrimmed ensemble's, forecast of ytest in the form of a cdf.
untrimmedEnsemblePMFs
For each testing set row, the untrimmed ensemble's pmf.
untrimmedEnsembleMeans
For each testing set row, the untrimmed ensemble's mean.
untrimmedEnsembleVars
For each testing set row, the untrimmed ensemble's variance.
untrimmedEnsemblePITs
For each testing set row, the untrimmed ensemble's probability integral transform (PIT), the empirical cdf evaluated at the realized ytest value. If ytest is NULL, NAs are returned.
untrimmedEnsembleQuantiles
For the last testing set row, the untrimmed ensemble's quantiles -- one for each element in uQuantiles.
untrimmedEnsembleComponentScores
For the last testing set row, the components of the untrimmed ensemble's linear and log quantile scores. If ytest is NULL, NAs are returned.
untrimmedEnsembleScores
For each testing set row, the untrimmed ensemble's linear and log quantile scores, ranked probability score, and two-moment score. If ytest is NULL, NAs are returned.

References

Gneiting T, Raftery AE. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102 359-378.

Jose VRR, Grushka-Cockayne Y, Lichtendahl KC Jr. (2014). Trimmed opinion pools and the crowd's calibration problem. Management Science 60 463-475.

Jose VRR, Winkler RL (2009). Evaluating quantile assessments. Operations Research 57 1287-1297.

Grushka-Cockayne Y, Jose VRR, Lichtendahl KC Jr. (2014). Ensembles of overfit and overconfident forecasts, working paper.

Larrick RP, Mannes AE, Soll JB (2011). The social psychology of the wisdom of crowds. In J.I. Krueger, ed., Frontiers in Social Psychology: Social Judgment and Decision Making. New York: Psychology Press, 227-242.

See Also

hitRate, cinbag

Examples

Run this code
# Load the data
set.seed(201) # Can be removed; useful for replication
data <- as.data.frame(mlbench.friedman1(500, sd=1))
summary(data)

# Prepare data for trimming
train <- data[1:400, ]
test <- data[401:500, ]
xtrain <- train[,-11]  
ytrain <- train[,11]
xtest <- test[,-11]
ytest <- test[,11]
      
# Option 1. Run trimTrees with responses in testing set.
set.seed(201) # Can be removed; useful for replication
tt1 <- trimTrees(xtrain, ytrain, xtest, ytest, trim=0.15)

#Some outputs from trimTrees: scores, hit rates, PIT densities.
colMeans(tt1$trimmedEnsembleScores)
colMeans(tt1$untrimmedEnsembleScores)
mean(hitRate(tt1$treePITs))
hitRate(tt1$trimmedEnsemblePITs)
hitRate(tt1$untrimmedEnsemblePITs)
hist(tt1$trimmedEnsemblePITs, prob=TRUE)
hist(tt1$untrimmedEnsemblePITs, prob=TRUE)

# Option 2. Run trimTrees without responses in testing set. 
# In this case, scores, PITs, or hit rates will not be available.
set.seed(201) # Can be removed; useful for replication
tt2 <- trimTrees(xtrain, ytrain, xtest, trim=0.15)

# Some outputs from trimTrees: cdfs for last test value.
plot(tt2$trimmedEnsembleCDFs[100,],type="l",col="red",ylab="cdf",xlab="y") 
lines(tt2$untrimmedEnsembleCDFs[100,])
legend(275,0.2,c("trimmed", "untrimmed"),col=c("red","black"),lty = c(1, 1))
title("CDFs of Trimmed and Untrimmed Ensembles")

# Compare the CDF and moment approaches to trimming the trees.
ttCDF <- trimTrees(xtrain, ytrain, xtest, trim=0.15, methodIsCDF=TRUE)
ttMA <- trimTrees(xtrain, ytrain, xtest, trim=0.15, methodIsCDF=FALSE)
plot(ttCDF$trimmedEnsembleCDFs[100,], type="l", col="red", ylab="cdf", xlab="y") 
lines(ttMA$trimmedEnsembleCDFs[100,])
legend(275,0.2,c("CDF Approach", "Moment Approach"), col=c("red","black"),lty = c(1, 1))
title("CDFs of Trimmed Ensembles")

Run the code above in your browser using DataLab