cforest: Conditional Random Forests

Description

An implementation of the random forest and bagging ensemble algorithms utilizing conditional inference trees as base learners.

Usage

cforest(formula, data, weights, subset, na.action = na.pass, 
        control = ctree_control(teststat = "quad",
                                testtype = "Univ", mincriterion = 0, ...), 
        ytrafo = NULL, scores = NULL, ntree = 500L, 
        perturb = list(replace = FALSE, fraction = 0.632), 
        mtry = ceiling(sqrt(nvar)), applyfun = NULL, cores = NULL, ...)
## S3 method for class 'cforest':
predict(object, newdata = NULL, 
        type = c("response", "prob", "weights", "node"),
        OOB = FALSE, FUN = NULL, simplify = TRUE, ...)

Arguments

formula

a symbolic description of the model to be fit.

data

a data frame containing the variables in the model.

subset

an optional vector specifying a subset of observations to be used in the fitting process.

weights

an optional vector of weights to be used in the fitting process. Non-negative integer valued weights are allowed as well as non-negative real weights. Observations are sampled (with or without replacement) according to probabilities

na.action

a function which indicates what should happen when the data contain missing value.

control

a list with control parameters, see ctree_control. The default values correspond to those of the default values used by cfo

ytrafo

an optional named list of functions to be applied to the response variable(s) before testing their association with the explanatory variables. Note that this transformation is only performed once for the roo

scores

an optional named list of scores to be attached to ordered factors.

ntree

Number of trees to grow for the forest.

perturb

a list with arguments replace and fraction determining which type of resampling with replace = TRUE referring to the n-out-of-n bootstrap and replace = FALSE to sample splitting. fraction

mtry

number of input variables randomly sampled as candidates
    at each node for random forest like algorithms. Bagging, as special case
    of a random forest without random input variable sampling, can
    be performed by setting mtry either e

applyfun

an optional lapply-style function with arguments
                  function(X, FUN, ...). It is used for computing the variable selection criterion.
                  The default is to use t

cores

numeric. If set to an integer the applyfun is set to
               mclapply with the desired number of cores.

object

An object as returned by cforest

newdata

An optional data frame containing test data.

type

a character string denoting the type of predicted value
          returned, ignored when argument FUN is given.  For
          "response", the mean of a numeric response, the predicted
          class for a categorical response o

OOB

a logical defining out-of-bag predictions (only if newdata = NULL).

FUN

a function to compute summary statistics. Predictions for each node have to be 
    computed based on arguments (y, w) where y is the response and 
    w are case weights.

simplify

a logical indicating whether the resulting list
                   of predictions should be converted to a suitable
                   vector or matrix (if possible).

...

additional arguments.

`Value`

An object of class cforest.

`Details`

This implementation of the random forest (and bagging) algorithm differs
  from the reference implementation in randomForest
  with respect to the base learners used and the aggregation scheme applied.
  Conditional inference trees, see ctree, are fitted to each
  of the ntree perturbed samples of the learning sample. Most of the hyper parameters in 
  ctree_control regulate the construction of the conditional inference trees.
  Hyper parameters you might want to change are:
  1. The number of randomly preselected variables mtry, which is fixed
     to the square root of the number of input variables.
  2. The number of trees ntree. Use more trees if you have more variables.
  3. The depth of the trees, regulated by mincriterion. Usually unstopped and unpruned
     trees are used in random forests. To grow large trees, set mincriterion to a small value.

  The aggregation scheme works by averaging observation weights extracted
  from each of the ntree trees and NOT by averaging predictions directly
  as in randomForest.
  See Hothorn et al. (2004) and Meinshausen (2006) for a description.
  Predictions can be computed using predict. For observations
  with zero weights, predictions are computed from the fitted tree 
  when newdata = NULL. 
  Ensembles of conditional inference trees have not yet been extensively   
  tested, so this routine is meant for the expert user only and its current
  state is rather experimental. However, there are some things available 
  in cforest that can't be done with randomForest,  
  for example fitting forests to censored response variables (see Hothorn et al., 2004, 2006a) or to
  multivariate and ordered responses. Using the rich partykit infrastructure allows 
  additional functionality in cforest, such as parallel tree growing and probabilistic 
  forecasting (for example via quantile regression forests). Also plotting of single trees from
  a forest is much easier now.
  Unlike cforest, cforest is entirely written in R which
  makes customisation much easier at the price of longer computing times. However, trees
  can be grown in parallel with this R only implemention which renders speed less of an issue. 
  Note that the default values are different from those used in package party, most
  importantly the default for mtry is now data-dependent. predict(, type = "node") replaces
  the where function and predict(, type = "prob") the 
  treeresponse function.
  
  Moreover, when predictors vary in their scale of measurement of number 
  of categories, variable selection and computation of variable importance is biased 
  in favor of variables with many potential cutpoints in randomForest,
  while in cforest unbiased trees and an adequate resampling scheme
  are used by default. See Hothorn et al. (2006b) and Strobl et al. (2007)
  as well as Strobl et al. (2009).

`References`

Leo Breiman (2001). Random Forests. Machine Learning, 45(1), 5--32.
    Torsten Hothorn, Berthold Lausen, Axel Benner and Martin Radespiel-Troeger
    (2004). Bagging Survival Trees. Statistics in Medicine, 23(1), 77--91.
    Torsten Hothorn, Peter Buhlmann, Sandrine Dudoit, Annette Molinaro 
    and Mark J. van der Laan (2006a). Survival Ensembles. Biostatistics,
    7(3), 355--373.
    Torsten Hothorn, Kurt Hornik, Achim Zeileis (2006b). Unbiased
    Recursive Partitioning: A Conditional Inference Framework.
    Journal of Computational and Graphical Statistics, 15(3),
    651--674.  Preprint available from 
    http://statmath.wu-wien.ac.at/~zeileis/papers/Hothorn+Hornik+Zeileis-2006.pdf
    Nicolai Meinshausen (2006). Quantile Regression Forests. 
    Journal of Machine Learning Research, 7, 983--999.
    Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, Torsten Hothorn (2007).
    Bias in Random Forest Variable Importance Measures: Illustrations, Sources and
    a Solution. BMC Bioinformatics, 8, 25.
    http://www.biomedcentral.com/1471-2105/8/25
    Carolin Strobl, James Malley, Gerhard Tutz (2009).
    An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of
    Classification and Regression Trees, Bagging, and Random forests.
    Psychological Methods, 14(4), 323--348.

`Examples`

Run this code## basic example: conditional inference forest for cars data
cf <- cforest(dist ~ speed, data = cars)

## prediction of fitted mean and visualization
nd <- data.frame(speed = 4:25)
nd$mean  <- predict(cf, newdata = nd, type = "response")
plot(dist ~ speed, data = cars)
lines(mean ~ speed, data = nd)

## predict quantiles (aka quantile regression forest)
myquantile <- function(y, w) quantile(rep(y, w), probs = c(0.1, 0.5, 0.9))
p <- predict(cf, newdata = nd, type = "response", FUN = myquantile)
colnames(p) <- c("lower", "median", "upper")
nd <- cbind(nd, p)

## visualization with conditional (on speed) prediction intervals
plot(dist ~ speed, data = cars, type = "n")
with(nd, polygon(c(speed, rev(speed)), c(lower, rev(upper)),
  col = "lightgray", border = "transparent"))
points(dist ~ speed, data = cars)
lines(mean ~ speed, data = nd, lwd = 1.5)
lines(median ~ speed, data = nd, lty = 2, lwd = 1.5)
legend("topleft", c("mean", "median", "10% - 90% quantile"),
  lwd = c(1.5, 1.5, 10), lty = c(1, 2, 1),
  col = c("black", "black", "lightgray"), bty = "n")

### we may also use predicted conditional (on speed) densities
mydensity <- function (y, w) approxfun(density(y, weights = w/sum(w))[1:2], rule = 2)
pd <- predict(cf, newdata = nd, type = "response", FUN = mydensity)

## visualization in heatmap (instead of scatterplot)
## with fitted curves as above
dist <- -10:150
dens <- t(sapply(seq_along(pd), function(i) pd[[i]](dist)))
image(nd$speed, dist, dens, xlab = "speed", col = rev(gray.colors(9)))
lines(mean ~ speed, data = nd, lwd = 1.5)
lines(median ~ speed, data = nd, lty = 2, lwd = 1.5)
lines(lower ~ speed, data = nd, lty = 2)
lines(upper ~ speed, data = nd, lty = 2)

### honest (i.e., out-of-bag) cross-classification of
### true vs. predicted classes
data("mammoexp", package = "TH.data")
table(mammoexp$ME, predict(cforest(ME ~ ., data = mammoexp, ntree = 50),
                           OOB = TRUE, type = "response"))

### fit forest to censored response
if (require("TH.data") && require("survival")) {

    data("GBSG2", package = "TH.data")
    bst <- cforest(Surv(time, cens) ~ ., data = GBSG2, ntree = 50)
 
    ### estimate conditional Kaplan-Meier curves
    print(predict(bst, newdata = GBSG2[1:2,], OOB = TRUE, type = "prob"))

    print(bst$nodes[[1]])
}
Run the code above in your browser using DataLab