tune.rfsrc: Tune Random Forest for the optimal mtry and nodesize parameters

Description

Finds the optimal mtry and nodesize tuning parameter for a random forest using out-of-sample error. Applies to all families.

Usage

# S3 method for rfsrc
tune(formula, data,
  mtryStart = ncol(data) / 2,
  nodesizeTry = c(1:9, seq(10, 100, by = 5)), ntreeTry = 100,
  sampsize = function(x){min(x * .632, max(150, x ^ (3/4)))},
  nsplit = 1, stepFactor = 1.25, improve = 1e-3, strikeout = 3, maxIter = 25,
  trace = FALSE, doBest = TRUE, ...)
# S3 method for rfsrc
tune.nodesize(formula, data,
  nodesizeTry = c(1:9, seq(10, 150, by = 5)), ntreeTry = 100,
  sampsize = function(x){min(x * .632, max(150, x ^ (4/5)))},
  nsplit = 1, trace = TRUE, ...)

Arguments

formula: A symbolic formula describing the model to be fit.
data: A data frame containing the response variable and predictor variables.
mtryStart: Initial value of mtry used to start the tuning search.
nodesizeTry: Vector of nodesize values over which tuning is performed.
ntreeTry: Number of trees used during the tuning step.
sampsize: Function specifying the size of the subsample. Can also be a numeric value.
nsplit: Number of random split points considered when splitting a node.
stepFactor: Multiplicative factor used to adjust mtry at each iteration.
improve: Minimum relative improvement in out-of-sample error required to continue the search.
strikeout: Number of consecutive non-improving steps (negative improvement) allowed before stopping the search. Increase to allow a more exhaustive search.
maxIter: Maximum number of iterations allowed for the mtry bisection search.
trace: If TRUE, prints progress during the search.
doBest: If TRUE, fits and returns a forest using the optimal mtry and nodesize.
...: Additional arguments passed to rfsrc.fast.

Author

Hemant Ishwaran and Udaya B. Kogalur

Details

tune returns a matrix with three columns: the first and second columns contain the nodesize and mtry values evaluated during the tuning process, and the third column contains the corresponding out-of-sample error.

The error is standardized. For multivariate forests, it is averaged over the outcomes; for competing risks, it is averaged over the event types.

If doBest = TRUE, the function also returns a forest object fit using the optimal mtry and nodesize values.

All tuning calculations, including the final optimized forest, are performed using the fast forest interface rfsrc.fast, which relies on subsampling. This makes the procedure computationally efficient but approximate. Users seeking more accurate tuning results may wish to adjust parameters such as:

Increasing sampsize, which controls the size of the subsample used for tuning.
Increasing ntreeTry, which defaults to 100 for speed.

It is also helpful to visualize the out-of-sample error surface as a function of mtry and nodesize using a contour plot (see example below) to identify regions of low error.

The function tune.nodesize performs a simplified search by optimizing only over nodesize.

Examples

Run this code

# \donttest{
## ------------------------------------------------------------
## White wine classification example
## ------------------------------------------------------------

## load the data
data(wine, package = "randomForestSRC")
wine$quality <- factor(wine$quality)

## set the sample size manually
o <- tune(quality ~ ., wine, sampsize = 100)

## here is the optimized forest 
print(o$rf)

## visualize the nodesize/mtry OOB surface
if (library("interp", logical.return = TRUE)) {

  ## nice little wrapper for plotting results
  plot.tune <- function(o, linear = TRUE) {
    x <- o$results[,1]
    y <- o$results[,2]
    z <- o$results[,3]
    so <- interp(x=x, y=y, z=z, linear = linear)
    idx <- which.min(z)
    x0 <- x[idx]
    y0 <- y[idx]
    filled.contour(x = so$x,
                   y = so$y,
                   z = so$z,
                   xlim = range(so$x, finite = TRUE) + c(-2, 2),
                   ylim = range(so$y, finite = TRUE) + c(-2, 2),
                   color.palette =
                     colorRampPalette(c("yellow", "red")),
                   xlab = "nodesize",
                   ylab = "mtry",
                   main = "error rate for nodesize and mtry",
                   key.title = title(main = "OOB error", cex.main = 1),
                   plot.axes = {axis(1);axis(2);points(x0,y0,pch="x",cex=1,font=2);
                                points(x,y,pch=16,cex=.25)})
  }

  ## plot the surface
  plot.tune(o)

}

## ------------------------------------------------------------
## tuning for class imbalanced data problem
## - see imbalanced function for details
## - use rfq and perf.type = "gmean" 
## ------------------------------------------------------------

data(breast, package = "randomForestSRC")
breast <- na.omit(breast)
o <- tune(status ~ ., data = breast, rfq = TRUE, perf.type = "gmean")
print(o)


## ------------------------------------------------------------
## tune nodesize for competing risk - wihs data 
## ------------------------------------------------------------

data(wihs, package = "randomForestSRC")
plot(tune.nodesize(Surv(time, status) ~ ., wihs, trace = TRUE)$err)

# }