tune.rfsrc: Tune Random Forest for optimal `mtry` and `nodesize`

Description

Finds the optimal mtry and nodesize for a random forest using out-of-bag (OOB) error. Two search strategies are supported: a grid-based search and a golden-section search with noise control. Works for all response families supported by rfsrc.fast.

Usage

# S3 method for rfsrc
tune(formula, data,
  mtry.start = ncol(data) / 2,
  nodesize.try = c(1:9, seq(10, 100, by = 5)), ntree.try = 100,
  sampsize = function(x) { min(x * .632, max(150, x^(3/4))) },
  nsplit = 1, step.factor = 1.25, improve = 1e-3, strikeout = 3, max.iter = 25,
  method = c("grid", "golden"),
  final.window = 5, reps.initial = 2, reps.final = 3,
  trace = FALSE, do.best = TRUE, seed = NULL, ...)
# S3 method for rfsrc
tune.nodesize(formula, data,
  nodesize.try = c(1:9, seq(10, 150, by = 5)), ntree.try = 100,
  sampsize = function(x) { min(x * .632, max(150, x^(4/5))) },
  nsplit = 1, method = c("grid", "golden"),
  final.window = 5, reps.initial = 2, reps.final = 3, max.iter = 50,
  trace = TRUE, seed = NULL, ...)

Value

For tune:

results: matrix with columns nodesize, mtry, err.
optimal: named numeric vector c(nodesize = ..., mtry = ...).
rf: fitted forest at the optimum if do.best = TRUE.

For tune.nodesize:

nsize.opt: optimal nodesize.
err: data frame with columns nodesize and err.

Arguments

formula: A model formula.
data: A data frame with response and predictors.
mtry.start: Initial mtry for tune.
nodesize.try: Candidate nodesize values. Only values \(\le\) floor(sampsize(n)/2) are used.
ntree.try: Number of trees grown at each tuning evaluation.
sampsize: Function or numeric giving the per-tree subsample size. During tuning a single numeric size ssize is computed and passed to rfsrc.fast. If a vector is supplied (e.g., class specific), its total is used for ssize.
nsplit: Number of random split points to consider at each node.
step.factor: Multiplicative step-out factor over mtry for grid search in tune.
improve: Minimum relative improvement required to continue a search step in tune.
strikeout: Maximum number of consecutive non-improving steps allowed in tune.
max.iter: Maximum number of iterations for the step-out search in tune or the coordinate loop when method = "golden".
method: Search strategy: "grid" (default) or "golden".
final.window: For golden search, the terminal bracket width for the one-dimensional line search.
reps.initial: Replicates averaged at interior evaluations during golden iterations.
reps.final: Replicates averaged for each candidate during the final local sweep in golden search.
trace: If TRUE, prints progress.
do.best: If TRUE, tune fits and returns a forest at the optimal pair.
seed: Optional integer for reproducible tuning. The holdout split (when used) and all tuning fits become deterministic for a given seed.
...: Additional arguments passed to rfsrc.fast. Arguments that control tuning itself (perf.type, forest, save.memory, ntree, mtry, nodesize, sampsize, nsplit) are managed internally.

Author

Hemant Ishwaran and Udaya B. Kogalur

Details

Error estimate. If 2 * ssize < n, a disjoint holdout of size ssize is used for evaluation; otherwise OOB error is used.

Subsample used during tuning. Both functions derive a single integer ssize from sampsize and pass it to rfsrc.fast for all tuning fits. This improves stability and comparability across candidates. When do.best = TRUE in tune, the final forest is fit with the user-supplied sampsize exactly as provided.

Grid search. tune performs a step-out search over mtry for each nodesize in nodesize.try, using step.factor, improve, strikeout, and max.iter. tune.nodesize evaluates the supplied nodesize.try grid directly.

Golden search. Uses a guarded golden-section line search with noise control. For each one-dimensional search (over nodesize or mtry), the routine probes a small left-anchor grid 1:9, iterates golden shrinkage until the bracket width is at most final.window, then runs a short local sweep with reps.final replicates. In tune the searches over nodesize and mtry alternate in a simple coordinate loop, with improve and strikeout as stopping controls.

Examples

Run this code

# \donttest{
## ------------------------------------------------------------
## White wine classification example
## ------------------------------------------------------------
data(wine, package = "randomForestSRC")
wine$quality <- factor(wine$quality)

## Fixed seed makes tuning reproducible
set.seed(1)

## Full tuner over nodesize and mtry (grid)
o1 <- tune(quality ~ ., wine, sampsize = 100, method = "grid")
print(o1$optimal)

## Golden search alternative
o2 <- tune(quality ~ ., wine, sampsize = 100, method = "golden",
           reps.initial = 2, reps.final = 3, seed = 1)
print(o2$optimal)

## visualize the nodesize/mtry surface
if (library("interp", logical.return = TRUE)) {

  plot.tune <- function(o, linear = TRUE) {
    x <- o$results[, 1]
    y <- o$results[, 2]
    z <- o$results[, 3]
    so <- interp(x = x, y = y, z = z, linear = linear)
    idx <- which.min(z)
    x0 <- x[idx]; y0 <- y[idx]
    filled.contour(x = so$x, y = so$y, z = so$z,
                   xlim = range(so$x, finite = TRUE) + c(-2, 2),
                   ylim = range(so$y, finite = TRUE) + c(-2, 2),
                   color.palette = colorRampPalette(c("yellow", "red")),
                   xlab = "nodesize", ylab = "mtry",
                   main = "error rate for nodesize and mtry",
                   key.title = title(main = "OOB error", cex.main = 1),
                   plot.axes = {
                     axis(1); axis(2)
                     points(x0, y0, pch = "x", cex = 1, font = 2)
                     points(x, y, pch = 16, cex = .25)
                   })
  }

  plot.tune(o1)
  plot.tune(o2)
}

## ------------------------------------------------------------
## nodesize only: grid vs golden
## ------------------------------------------------------------
o3 <- tune.nodesize(quality ~ ., wine, sampsize = 100, method = "grid",
                    trace = TRUE, seed = 1)
o4 <- tune.nodesize(quality ~ ., wine, sampsize = 100, method = "golden",
                    reps.initial = 2, reps.final = 3, trace = TRUE, seed = 1)
plot(o3$err, type = "s", xlab = "nodesize", ylab = "error")

## ------------------------------------------------------------
## Tuning for class imbalance (rfq with geometric mean performance)
## ------------------------------------------------------------
data(breast, package = "randomForestSRC")
breast <- na.omit(breast)
o5 <- tune(status ~ ., data = breast, rfq = TRUE, perf.type = "gmean",
           method = "golden", seed = 1)
print(o5$optimal)

## ------------------------------------------------------------
## Competing risks example (nodesize only)
## ------------------------------------------------------------
data(wihs, package = "randomForestSRC")
plot(tune.nodesize(Surv(time, status) ~ ., wihs, trace = TRUE)$err, type = "s")
# }

Run the code above in your browser using DataLab