max.subtree: Acquire Maximal Subtree Information

Description

Extract maximal subtree information from a RF-SRC object. Used for variable selection and identifying interactions between variables.

Usage

# S3 method for rfsrc
max.subtree(object,
  max.order = 2, sub.order = FALSE, conservative = FALSE, ...)

Arguments

object

An object of class (rfsrc, grow) or (rfsrc, forest)

max.order

Non-negative integer specifying the target number of order depths. Default is to return the first and second order depths. Used to identify predictive variables. Setting max.order=0 returns the first order depth for each variable by tree. A side effect is that conservative is automatically set to FALSE.

sub.order

Set this value to TRUE to return the minimal depth of each variable relative to another variable. Used to identify interrelationship between variables. See details below.

conservative

If TRUE, the threshold value for selecting variables is calculated using a conservative marginal approximation to the minimal depth distribution (the method used in Ishwaran et al. 2010). Otherwise, the minimal depth distribution is the tree-averaged distribution. The latter method tends to give larger threshold values and discovers more variables, especially in high-dimensions.

...

Further arguments passed to or from other methods.

Value

Invisibly, a list with the following components:

order

Order depths for a given variable up to max.order averaged over a tree and the forest. Matrix of dimension pxmax.order. If max.order=0, a matrix of pxntree is returned containing the first order depth for each variable by tree.

count

Averaged number of maximal subtrees, normalized by the size of a tree, for each variable.

nodes.at.depth

Number of non-terminal nodes by depth for each tree.

sub.order

Average minimal depth of a variable relative to another variable. Can be NULL.

threshold

Threshold value (the mean minimal depth) used to select variables.

threshold.1se

Mean minimal depth plus one standard error.

topvars

Character vector of names of the final selected variables.

topvars.1se

Character vector of names of the final selected variables using the 1se threshold rule.

percentile

Minimal depth percentile for each variable.

density

Estimated minimal depth density.

second.order.threshold

Threshold for second order depth.

Details

The maximal subtree for a variable x is the largest subtree whose root node splits on x. Thus, all parent nodes of x's maximal subtree have nodes that split on variables other than x. The largest maximal subtree possible is the root node. In general, however, there can be more than one maximal subtree for a variable. A maximal subtree may also not exist if there are no splits on the variable. See Ishwaran et al. (2010, 2011) for details.

The minimal depth of a maximal subtree (the first order depth) measures predictiveness of a variable x. It equals the shortest distance (the depth) from the root node to the parent node of the maximal subtree (zero is the smallest value possible). The smaller the minimal depth, the more impact x has on prediction. The mean of the minimal depth distribution is used as the threshold value for deciding whether a variable's minimal depth value is small enough for the variable to be classified as strong.

The second order depth is the distance from the root node to the second closest maximal subtree of x. To specify the target order depth, use the max.order option (e.g., setting max.order=2 returns the first and second order depths). Setting max.order=0 returns the first order depth for each variable for each tree.

Set sub.order=TRUE to obtain the minimal depth of a variable relative to another variable. This returns a pxp matrix, where p is the number of variables, and entries [i][j] are the normalized relative minimal depth of a variable [j] within the maximal subtree for variable [i], where normalization adjusts for the size of [i]'s maximal subtree. Entry [i][i] is the normalized minimal depth of i relative to the root node. The matrix should be read by looking across rows (not down columns) and identifies interrelationship between variables. Small [i][j] entries indicate interactions. See find.interaction for related details.

For competing risk data, maximal subtree analyses are unconditional (i.e., they are non-event specific).

References

Ishwaran H., Kogalur U.B., Gorodeski E.Z, Minn A.J. and Lauer M.S. (2010). High-dimensional variable selection for survival data. J. Amer. Statist. Assoc., 105:205-217.

Ishwaran H., Kogalur U.B., Chen X. and Minn A.J. (2011). Random survival forests for high-dimensional data. Statist. Anal. Data Mining, 4:115-132.

Examples

Run this code

# NOT RUN {
## ------------------------------------------------------------
## survival analysis
## first and second order depths for all variables
## ------------------------------------------------------------

data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ . , data = veteran)
v.max <- max.subtree(v.obj)

# first and second order depths
print(round(v.max$order, 3))

# the minimal depth is the first order depth
print(round(v.max$order[, 1], 3))

# strong variables have minimal depth less than or equal
# to the following threshold
print(v.max$threshold)

# this corresponds to the set of variables
print(v.max$topvars)

## ------------------------------------------------------------
## regression analysis
## try different levels of conservativeness
## ------------------------------------------------------------

mtcars.obj <- rfsrc(mpg ~ ., data = mtcars)
max.subtree(mtcars.obj)$topvars
max.subtree(mtcars.obj, conservative = TRUE)$topvars
# }

Run the code above in your browser using DataLab