bagging: Bagging Classification and Regression Trees

Description

Bootstrap aggregated classification and regression trees.

Usage

## S3 method for class 'default':
bagging(y, X=NULL, nbagg=25, method=c("standard","double"),
        coob=TRUE, control= rpart.control(minsize=2, cp=0), ...)
## S3 method for class 'formula':
bagging(formula, data, subset, na.action=na.rpart, \dots)

Arguments

vector of responses: either numerical (regression) or factors (classification).

data frame of predictors.

nbagg

number of bootstrap replications.

method

standard for Bagging and double for Double-Bagging.

coob

logical. Compute an out-of-bag estimate of the misclassification or mean-squared error.

control

options that control details of the rpart algorithm, see rpart.control.

formula

formula describing the model: y ~ x + w + z, where y is the response and x,w,z are predictors, see lm for details.

data

optional data frame containing the variables in the model formula.

subset

optional vector specifying a subset of observations to be used.

na.action

function which indicates what should happen when the data contain NAs. Defaults to na.rpart.

...

additional parameters to methods (e.g. rpart).

Value

An object of class bagging: a list containing the following objects
mtlist of length nbagg containing rpart trees.
oobout-of-bag predictions for each observation.
errout-of-bag error estimate.
nbaggnumber of bootstrap samples and trees used.
methodmethod used.
ldascdiscriminant functions of LDA (for Double-Bagging only).

Details

Bootstrap aggregated classification and regression trees were suggested by Breiman (1996, 1998) in order to stabilise trees. This function is based on trees computed by rpart. If y is a factor, classification trees are constructed, regression trees otherwise. nbagg bootstrap samples are drawn and a tree is constructed for each of them. If coob is TRUE, the out-of-bag sample is used to estimate the prediction error. Double-Bagging (Hothorn and Lausen, 2002) computes a LDA on the out-of-bag sample and uses the discriminant variables as additional predictors for the classification trees. Therefore, an out-of-bag estimate of misclassification error is not available for method="double".

print.bagging and summary.bagging are available for the inspection of the results as well as predict.bagging for prediction. Additionally, the function prune.bagging can be used to prune each of the nbagg trees. By default, the trees are not pruned and the tree growing is not stopped until the nodes are pure.

References

Leo Breiman (1996), Bagging Predictors. Machine Learning 24(2), 123--140.

Leo Breiman (1998), Arcing Classifiers. The Annals of Statistics 26(3), 801--824.

Torsten Hothorn and Berthold Lausen (2002), Double-Bagging: Combining classifiers by bootstrap aggregation. submitted, preprint available under http://www.mathpreprints.com/math/Preprint/hothorn/20020227.2/1.

Examples

Run this code

X <- as.data.frame(matrix(rnorm(1000), ncol=10))
y <- factor(ifelse(apply(X, 1, mean) > 0, 1, 0))
learn <- cbind(y, X)

mt <- bagging(y ~., data=learn, coob=TRUE)
mt

X <- as.data.frame(matrix(rnorm(1000), ncol=10))
y <- factor(ifelse(apply(X, 1, mean) > 0, 1, 0))

cls <- predict(mt, newdata=X)

cat("Misclass error est: ", mean(y != cls), "")
cat("Misclass error oob: ", mt$err, "")

X <- as.data.frame(matrix(rnorm(1000), ncol=10))
y <- apply(X, 1, mean) + rnorm(nrow(X))

learn <- cbind(y, X)

mt <- bagging(y ~., data=learn, coob=TRUE)
mt
 
X <- as.data.frame(matrix(rnorm(1000), ncol=10))
y <- apply(X, 1, mean) + rnorm(nrow(X))

haty <- predict(mt, newdata=X)

cat("MSE error: ", mean((haty - y)^2) , "")

data(BreastCancer)
BreastCancer$Id <- NULL

# Test set error bagging (nbagg = 50): 3.7\% (Breiman, 1998, Table 5)

bagging(Class ~ Cl.thickness + Cell.size 
                + Cell.shape + Marg.adhesion
                + Epith.c.size + Bare.nuclei
                + Bl.cromatin + Normal.nucleoli
                + Mitoses, data=BreastCancer, coob=TRUE)

Run the code above in your browser using DataLab