Bagging Classification, Regression and Survival Trees
Bagging for classification, regression and survival trees.
"ipredbagg"(y, X=NULL, nbagg=25, control= rpart.control(minsplit=2, cp=0, xval=0), comb=NULL, coob=FALSE, ns=length(y), keepX = TRUE, ...) "ipredbagg"(y, X=NULL, nbagg=25, control=rpart.control(xval=0), comb=NULL, coob=FALSE, ns=length(y), keepX = TRUE, ...) "ipredbagg"(y, X=NULL, nbagg=25, control=rpart.control(xval=0), comb=NULL, coob=FALSE, ns=dim(y), keepX = TRUE, ...) "bagging"(formula, data, subset, na.action=na.rpart, ...)
- the response variable: either a factor vector of class labels
(bagging classification trees), a vector of numerical values
(bagging regression trees) or an object of class
Surv(bagging survival trees).
- a data frame of predictor variables.
- an integer giving the number of bootstrap replications.
- a logical indicating whether an out-of-bag estimate of the
error rate (misclassification error, root mean squared error
or Brier score) should be computed.
- options that control details of the
rpart.control. It is wise to set
xval = 0in order to save computing time. Note that the default values depend on the class of
- a list of additional models for model combination, see below
for some examples. Note that argument
methodfor double-bagging is no longer there,
combis much more flexible.
- number of sample to draw from the learning sample. By default,
the usual bootstrap n out of n with replacement is performed.
nsis smaller than
length(y), subagging (Buehlmann and Yu, 2002), i.e. sampling
length(y)without replacement, is performed.
- a logical indicating whether the data frame of predictors
should be returned. Note that the computation of the
out-of-bag estimator requires
- a formula of the form
lhs ~ rhswhere
lhsis the response variable and
rhsa set of predictors.
- optional data frame containing the variables in the model formula.
- optional vector specifying a subset of observations to be used.
- function which indicates what should happen when
the data contain
NAs. Defaults to
- additional parameters passed to
Bagging for classification and regression trees were suggested by Breiman (1996a, 1998) in order to stabilise trees.
The trees in this function are computed using the implementation in the
rpart package. The generic function
implements methods for different responses. If
y is a factor,
classification trees are constructed. For numerical vectors
y, regression trees are aggregated and if
y is a survival
object, bagging survival trees (Hothorn et al, 2003) is performed.
bagging offers a formula based interface to
nbagg bootstrap samples are drawn and a tree is constructed
for each of them. There is no general rule when to stop the tree
growing. The size of the
trees can be controlled by
default, classification trees are as large as possible whereas regression
trees and survival trees are build with the standard options of
nbagg=1, one single tree is
computed for the whole learning sample without bootstrapping.
coob is TRUE, the out-of-bag sample (Breiman,
1996b) is used to estimate the prediction error
class(y). Alternatively, the out-of-bag sample can
be used for model combination, an out-of-bag error rate estimator is not
available in this case. Double-bagging (Hothorn and Lausen,
2003) computes a LDA on the out-of-bag sample and uses the discriminant
variables as additional predictors for the classification trees.
is an optional list of lists with two elements
model is a function with arguments
predict is a function with arguments
object, newdata only. If
the estimation of the covariance matrix in
lda fails due to a
limited out-of-bag sample size, one can use
See the example section for an example of double-bagging. The methodology is
not limited to a combination with LDA: bundling (Hothorn and Lausen, 2002b)
can be used with arbitrary classifiers.
NOTE: Up to ipred version 0.9-0, bagging was performed using a modified version of the original rpart function. Due to interface changes in rpart 3.1-55, the bagging function had to be rewritten. Results of previous version are not exactly reproducible.
The class of the object returned depends on
- the vector of responses.
- the data frame of predictors.
- multiple trees: a list of length
nbaggcontaining the trees (and possibly additional objects) for each bootstrap sample.
- logical whether the out-of-bag estimate should be computed.
OOB=TRUE, the out-of-bag estimate of misclassification or root mean squared error or the Brier score for censored data.
- logical whether a combination of models was requested. For each class methods for the generics
survbagg. Each is a list with elements
predictare available for inspection of the results and prediction, for example:
prune.classbaggfor classification problems.
Leo Breiman (1996a), Bagging Predictors. Machine Learning 24(2), 123--140.
Leo Breiman (1996b), Out-Of-Bag Estimation. Technical Report http://www.stat.berkeley.edu/~breiman/OOBestimation.pdf.
Leo Breiman (1998), Arcing Classifiers. The Annals of Statistics 26(3), 801--824.
Peter Buehlmann and Bin Yu (2002), Analyzing Bagging. The Annals of Statistics 30(4), 927--961.
Torsten Hothorn and Berthold Lausen (2003), Double-Bagging: Combining classifiers by bootstrap aggregation. Pattern Recognition, 36(6), 1303--1309.
Torsten Hothorn and Berthold Lausen (2005), Bundling Classifiers by Bagging Trees. Computational Statistics & Data Analysis, 49, 1068--1078.
Torsten Hothorn, Berthold Lausen, Axel Benner and Martin Radespiel-Troeger (2004), Bagging Survival Trees. Statistics in Medicine, 23(1), 77--91.
library("MASS") library("survival") # Classification: Breast Cancer data data("BreastCancer", package = "mlbench") # Test set error bagging (nbagg = 50): 3.7% (Breiman, 1998, Table 5) mod <- bagging(Class ~ Cl.thickness + Cell.size + Cell.shape + Marg.adhesion + Epith.c.size + Bare.nuclei + Bl.cromatin + Normal.nucleoli + Mitoses, data=BreastCancer, coob=TRUE) print(mod) # Test set error bagging (nbagg=50): 7.9% (Breiman, 1996a, Table 2) data("Ionosphere", package = "mlbench") Ionosphere$V2 <- NULL # constant within groups bagging(Class ~ ., data=Ionosphere, coob=TRUE) # Double-Bagging: combine LDA and classification trees # predict returns the linear discriminant values, i.e. linear combinations # of the original predictors comb.lda <- list(list(model=lda, predict=function(obj, newdata) predict(obj, newdata)$x)) # Note: out-of-bag estimator is not available in this situation, use # errorest mod <- bagging(Class ~ ., data=Ionosphere, comb=comb.lda) predict(mod, Ionosphere[1:10,]) # Regression: data("BostonHousing", package = "mlbench") # Test set error (nbagg=25, trees pruned): 3.41 (Breiman, 1996a, Table 8) mod <- bagging(medv ~ ., data=BostonHousing, coob=TRUE) print(mod) library("mlbench") learn <- as.data.frame(mlbench.friedman1(200)) # Test set error (nbagg=25, trees pruned): 2.47 (Breiman, 1996a, Table 8) mod <- bagging(y ~ ., data=learn, coob=TRUE) print(mod) # Survival data # Brier score for censored data estimated by # 10 times 10-fold cross-validation: 0.2 (Hothorn et al, # 2002) data("DLBCL", package = "ipred") mod <- bagging(Surv(time,cens) ~ MGEc.1 + MGEc.2 + MGEc.3 + MGEc.4 + MGEc.5 + MGEc.6 + MGEc.7 + MGEc.8 + MGEc.9 + MGEc.10 + IPI, data=DLBCL, coob=TRUE) print(mod)