Learn R Programming

randomForestSRC (version 3.4.1)

tune.rfsrc: Tune Random Forest for the optimal mtry and nodesize parameters

Description

Finds the optimal mtry and nodesize tuning parameter for a random forest using out-of-sample error. Applies to all families.

Usage

# S3 method for rfsrc
tune(formula, data,
  mtryStart = ncol(data) / 2,
  nodesizeTry = c(1:9, seq(10, 100, by = 5)), ntreeTry = 100,
  sampsize = function(x){min(x * .632, max(150, x ^ (3/4)))},
  nsplit = 1, stepFactor = 1.25, improve = 1e-3, strikeout = 3, maxIter = 25,
  trace = FALSE, doBest = TRUE, ...)

# S3 method for rfsrc tune.nodesize(formula, data, nodesizeTry = c(1:9, seq(10, 150, by = 5)), ntreeTry = 100, sampsize = function(x){min(x * .632, max(150, x ^ (4/5)))}, nsplit = 1, trace = TRUE, ...)

Arguments

formula

A symbolic formula describing the model to be fit.

data

A data frame containing the response variable and predictor variables.

mtryStart

Initial value of mtry used to start the tuning search.

nodesizeTry

Vector of nodesize values over which tuning is performed.

ntreeTry

Number of trees used during the tuning step.

sampsize

Function specifying the size of the subsample. Can also be a numeric value.

nsplit

Number of random split points considered when splitting a node.

stepFactor

Multiplicative factor used to adjust mtry at each iteration.

improve

Minimum relative improvement in out-of-sample error required to continue the search.

strikeout

Number of consecutive non-improving steps (negative improvement) allowed before stopping the search. Increase to allow a more exhaustive search.

maxIter

Maximum number of iterations allowed for the mtry bisection search.

trace

If TRUE, prints progress during the search.

doBest

If TRUE, fits and returns a forest using the optimal mtry and nodesize.

...

Additional arguments passed to rfsrc.fast.

Author

Hemant Ishwaran and Udaya B. Kogalur

Details

tune returns a matrix with three columns: the first and second columns contain the nodesize and mtry values evaluated during the tuning process, and the third column contains the corresponding out-of-sample error.

The error is standardized. For multivariate forests, it is averaged over the outcomes; for competing risks, it is averaged over the event types.

If doBest = TRUE, the function also returns a forest object fit using the optimal mtry and nodesize values.

All tuning calculations, including the final optimized forest, are performed using the fast forest interface rfsrc.fast, which relies on subsampling. This makes the procedure computationally efficient but approximate. Users seeking more accurate tuning results may wish to adjust parameters such as:

  • Increasing sampsize, which controls the size of the subsample used for tuning.

  • Increasing ntreeTry, which defaults to 100 for speed.

It is also helpful to visualize the out-of-sample error surface as a function of mtry and nodesize using a contour plot (see example below) to identify regions of low error.

The function tune.nodesize performs a simplified search by optimizing only over nodesize.

See Also

rfsrc.fast

Examples

Run this code
# \donttest{
## ------------------------------------------------------------
## White wine classification example
## ------------------------------------------------------------

## load the data
data(wine, package = "randomForestSRC")
wine$quality <- factor(wine$quality)

## set the sample size manually
o <- tune(quality ~ ., wine, sampsize = 100)

## here is the optimized forest 
print(o$rf)

## visualize the nodesize/mtry OOB surface
if (library("interp", logical.return = TRUE)) {

  ## nice little wrapper for plotting results
  plot.tune <- function(o, linear = TRUE) {
    x <- o$results[,1]
    y <- o$results[,2]
    z <- o$results[,3]
    so <- interp(x=x, y=y, z=z, linear = linear)
    idx <- which.min(z)
    x0 <- x[idx]
    y0 <- y[idx]
    filled.contour(x = so$x,
                   y = so$y,
                   z = so$z,
                   xlim = range(so$x, finite = TRUE) + c(-2, 2),
                   ylim = range(so$y, finite = TRUE) + c(-2, 2),
                   color.palette =
                     colorRampPalette(c("yellow", "red")),
                   xlab = "nodesize",
                   ylab = "mtry",
                   main = "error rate for nodesize and mtry",
                   key.title = title(main = "OOB error", cex.main = 1),
                   plot.axes = {axis(1);axis(2);points(x0,y0,pch="x",cex=1,font=2);
                                points(x,y,pch=16,cex=.25)})
  }

  ## plot the surface
  plot.tune(o)

}

## ------------------------------------------------------------
## tuning for class imbalanced data problem
## - see imbalanced function for details
## - use rfq and perf.type = "gmean" 
## ------------------------------------------------------------

data(breast, package = "randomForestSRC")
breast <- na.omit(breast)
o <- tune(status ~ ., data = breast, rfq = TRUE, perf.type = "gmean")
print(o)


## ------------------------------------------------------------
## tune nodesize for competing risk - wihs data 
## ------------------------------------------------------------

data(wihs, package = "randomForestSRC")
plot(tune.nodesize(Surv(time, status) ~ ., wihs, trace = TRUE)$err)

# }

Run the code above in your browser using DataLab