# rfsrc

##### Random Forests for Survival, Regression, and Classification (RF-SRC)

Fast OpenMP parallel processing unified treatment of Breiman's random
forests (Breiman 2001) for a variety of data settings. Applies when
the y-response is numeric or categorical, yielding Breiman regression
and classification forests, while random survival forests (Ishwaran et
al. 2008, 2012) are grown for right-censored survival and competing
risk data. Multivariate regression and classification responses as
well as mixed regression/classification responses are also handled.
Also includes unsupervised forests and quantile regression forests,
`quantileReg`

. Different splitting rules invoked under
deterministic or random splitting are available for all families.
Variable predictiveness can be assessed using variable importance
(VIMP) measures for single, as well as grouped variables. Missing
data can be imputed on both training and test data; see
`impute`

. The forest object, informally referred to as
an RF-SRC object, contains many useful values which can be directly
extracted by the user and/or parsed using additional functions (see
the examples below).

This is the main entry point to the randomForestSRC
package. Also see `rfsrcFast`

for a fast
implementation of `rfsrc`

.

For more information about this package and OpenMP, use the command
`package?randomForestSRC`

.

- Keywords
- forest

##### Usage

```
rfsrc(formula, data, ntree = 1000,
mtry = NULL, ytry = NULL,
nodesize = NULL, nodedepth = NULL,
splitrule = NULL, nsplit = 10,
importance = c(FALSE, TRUE, "none", "permute", "random", "anti"),
block.size = if (importance == "none" || as.character(importance) == "FALSE") NULL
else 10,
ensemble = c("all", "oob", "inbag"),
bootstrap = c("by.root", "by.node", "none", "by.user"),
samptype = c("swr", "swor"), sampsize = NULL, samp = NULL, membership = FALSE,
na.action = c("na.omit", "na.impute"), nimpute = 1,
ntime, cause,
proximity = FALSE, distance = FALSE, forest.wt = FALSE,
xvar.wt = NULL, yvar.wt = NULL, split.wt = NULL, case.wt = NULL,
forest = TRUE,
var.used = c(FALSE, "all.trees", "by.tree"),
split.depth = c(FALSE, "all.trees", "by.tree"),
seed = NULL,
do.trace = FALSE,
statistics = FALSE,
...)
```

##### Arguments

- formula
A symbolic description of the model to be fit. If missing, unsupervised splitting is implemented.

- data
Data frame containing the y-outcome and x-variables.

- ntree
Number of trees in the forest.

- mtry
Number of variables randomly selected as candidates for splitting a node. The default is

`p`

/3 for regression, where`p`

equals the number of variables. For all other families (including unsupervised settings), the default is sqrt(`p`

). Values are always rounded up.- ytry
For unsupervised forests, sets the number of randomly selected pseudo-responses (see below for more details). The default is

`ytry`

=1, which selects one pseudo-response.- nodesize
Forest average number of unique cases (data points) in a terminal node. The defaults are: survival (15), competing risk (15), regression (5), classification (1), mixed outcomes (3), unsupervised (3). It is recommended to experiment with different

`nodesize`

values.- nodedepth
Maximum depth to which a tree should be grown. The default behaviour is that this parameter is ignored.

- splitrule
Splitting rule used to grow trees. See below for details.

- nsplit
Non-negative integer value. When zero or NULL, deterministic splitting for an x-variable is in effect. When non-zero, a maximum of nsplit split points are randomly chosen among the possible split points for the x-variable. This significantly increases speed.

- importance
Method for computing variable importance (VIMP). Because VIMP is computationally expensive, the default action is

`importance="none"`

(VIMP can always be recovered later using the functions`vimp`

or`predict`

). Setting`importance=TRUE`

implements permutation VIMP. See below for more details.- block.size
Should the cumulative error rate be calculated on every tree? When

`NULL`

, it will only be calculated on the last tree and the plot of the cumulative error rate will result in a flat line. To view the cumulative error rate on every nth tree, set the value to an integer between`1`

and`ntree`

. As an intended side effect, if importance is requested, VIMP is calculated in "blocks" of size equal to`block.size`

, thus resulting in a useful compromise between ensemble and permutation VIMP. The default action is to use 10 trees. See VIMP below for more details.- ensemble
Specifies the type of ensemble. By default both out-of-bag (OOB) and inbag ensembles are returned. Always use OOB values for interfence on the training data.

- bootstrap
Bootstrap protocol. The default is

`by.root`

which bootstraps the data by sampling with replacement at the root node before growing the tree (for sampling without replacement, see the option`samptype`

). If`by.node`

is choosen, the data is bootstrapped at each node during the grow process. If`none`

is chosen, the data is not bootstrapped at all. If`by.user`

is choosen, the bootstrap specified by`samp`

is used. It is not possible to return OOB ensembles or prediction error if`by.node`

or`none`

are in effect.- samptype
Type of bootstrap when

`by.root`

is in effect. Choices are`swr`

(sampling with replacement, the default action) and`swor`

(sampling without replacement).- sampsize
Requested size of bootstrap when

`by.root`

is in effect (if missing the default action is the usual bootstrap).- samp
Bootstrap specification when

`by.user`

is in effect. This is a array of dim`n x ntree`

specifying how many times each record appears inbag in the bootstrap for each tree.- membership
Should terminal node membership and inbag information be returned?

- na.action
Action taken if the data contains

`NA`

's. Possible values are`na.omit`

or`na.impute`

. The default`na.omit`

removes the entire record if even one of its entries is`NA`

(for x-variables this applies only to those specifically listed in 'formula'). Selecting`na.impute`

imputes the data. See below for more details regarding missing data imputation.- nimpute
Number of iterations of the missing data algorithm. Performance measures such as out-of-bag (OOB) error rates tend to become optimistic if

`nimpute`

is greater than 1.- ntime
Integer value used for survival to constrain ensemble calculations to a grid of

`ntime`

time points. Alternatively if a vector of values of length greater than one is supplied, it is assumed these are the time points to be used to constrain the calculations (note that the constrained time points used will be the observed event times closest to the user supplied time points). If no value is specified, the default action is to use all observed event times.- cause
Integer value between 1 and

`J`

indicating the event of interest for competing risks, where`J`

is the number of event types (this option applies only to competing risks and is ignored otherwise). While growing a tree, the splitting of a node is restricted to the event type specified by`cause`

. If not specified, the default is to use a composite splitting rule which is an average over the entire set of event types (a democratic approach). Users can also pass a vector of non-negative weights of length`J`

if they wish to use a customized composite split statistic (for example, passing a vector of ones reverts to the default composite split statistic). In all instances when`cause`

is set incorrectly, splitting reverts to the default. Finally, note that regardless of how`cause`

is specified, the returned forest object always provides estimates for all event types.- proximity
Proximity of cases as measured by the frequency of sharing the same terminal node. This is an

`n`

x`n`

matrix, which can be large. Choices are`inbag`

,`oob`

,`all`

,`TRUE`

, or`FALSE`

. Setting`proximity = TRUE`

is equivalent to`proximity = "inbag"`

.- distance
Distance between cases as measured by the ratio of the sum of the count of edges from each case to their immediate common ancestor node to the sum of the count of edges from each case to the root node. If the cases are co-terminal for a tree, this measure is zero and reduces to 1 - the proximity measure for these cases in a tree. This is an

`n`

x`n`

matrix, which can be large. Choices are`inbag`

,`oob`

,`all`

,`TRUE`

, or`FALSE`

. Setting`distance = TRUE`

is equivalent to`distance = "inbag"`

.- forest.wt
Should the forest weight matrix be calculated? Creates an

`n`

x`n`

matrix which can be used for prediction and constructing customized estimators. Choices are similar to proximity:`inbag`

,`oob`

,`all`

,`TRUE`

, or`FALSE`

. The default is`TRUE`

which is equivalent to`inbag`

.- xvar.wt
Vector of non-negative weights where entry

`k`

, after normalizing, is the probability of selecting variable`k`

as a candidate for splitting a node. Default is to use uniform weights. Vector must be of dimension`p`

, where`p`

equals the number of variables, otherwise the default is invoked. It is generally better to use real weights rather than integers. With larger sizes of`p`

, the slightly different sampling algorithms used in the two scenarios can result in dramatically different execution times.- yvar.wt
NOT YET IMPLEMENTED: Vector of non-negative weights where entry

`k`

, after normalizing, is the probability of selecting response`k`

as a candidate for inclusion in the split statistic in unsupervised settings. Default is to use uniform weights. Vector must be of the same length as the number of respones in the data set.- split.wt
Vector of non-negative weights where entry

`k`

, after normalizing, is the multiplier by which the split statistic for a variable is adjusted. A large value encourages the node to split on the variable. Default is to use uniform weights. Vector must be of dimension`p`

, where`p`

equals the number of variables, otherwise the default is invoked.- case.wt
Vector of non-negative weights where entry

`k`

, after normalizing, is the probability of selecting case`k`

as a candidate for the bootstrap. Default is to use uniform weights. Vector must be of dimension`n`

, where`n`

equals the number of cases in the processed data set (missing values may be removed, thus altering the original sample size). It is generally better to use real weights rather than integers. With larger sizes of`n`

, the slightly different sampling algorithms used in the two scenarios can result in dramatically different execution times. See the example below for the breast data set for an illustration of its use for class imbalanced data.- forest
Should the forest object be returned? Used for prediction on new data and required by many of the functions used to parse the RF-SRC object. It is recommended not to change the default setting.

- var.used
Return variables used for splitting? Default is

`FALSE`

. Possible values are`all.trees`

which returns a vector where each element records the number of times a split occurred on a variable, and`by.tree`

which is a matrix recording the number of times a split occurred on a variable in a specific tree.- split.depth
Records the minimal depth for each variable. Default is

`FALSE`

. Possible values are`all.trees`

which returns a matrix recording the minimal depth for a variable (columns) for a specific case (rows) averaged over the forest, and`by.tree`

which returns a three-dimensional array recording minimal depth for a specific case (first dimension) for a variable (second dimension) for a specific tree (third dimension).- seed
Negative integer specifying seed for the random number generator.

- do.trace
Number of seconds between updates to the user on approximate time to completion.

- statistics
Should split statistics be returned? Values can be parsed using

`stat.split`

.- ...
Further arguments passed to or from other methods.

##### Details

*Families*Do *not* set this value as the package automagically determines the underlying random forest family from the type of response and the formula supplied. There are eight possible scenarios:

Regression forests for continuous responses.

Multivariate regression forests for multivariate continuous responses.

Classification forests for factor responses.

Multivariate classification forests for multivariate factor responses.

Multivariate mixed forests for mixed continuous and factor responses.

Unsupervised forests when there is no response.

Survival forest for right-censored survival settings.

Competing risk survival forests for competing risk scenarios.

See below for how to code the response in the two different survival scenarios and for specifying a multivariate forest formula.

*Splitrules*Splitrules are set according to the option

`splitrule`

as follows:Regression analysis:

The default rule is weighted mean-squared error splitting

`mse`

(Breiman et al. 1984, Chapter 8.4).Unweighted and heavy weighted mean-squared error splitting rules can be invoked using splitrules

`mse.unwt`

and`mse.hvwt`

. Generally`mse`

works best, but see Ishwaran (2015) for details.Quantile regression splitting

`quantile.regr`

using the "check-loss" function. Requires specifying the target quantiles. See`quantileReg`

for further details.

Multivariate regression analysis: For multivariate regression responses, a composite normalized mean-squared error splitting rule is used.

Classification analysis:

The default rule is Gini index splitting

`gini`

(Breiman et al. 1984, Chapter 4.3).Unweighted and heavy weighted Gini index splitting rules can be invoked using splitrules

`gini.unwt`

and`gini.hvwt`

. Generally`gini`

works best, but see Ishwaran (2015) for details.

Multivariate classification analysis: For multivariate classification responses, a composite normalized Gini index splitting rule is used.

Mixed outcomes analysis: When both regression and classification responses are detected, a multivariate normalized composite split rule of mean-squared error and Gini index splitting is invoked. See Tang and Ishwaran (2017) for details.

Unsupervised analysis: In settings where there is no outcome, unsupervised splitting is invoked. In this case, the mixed outcome splitting rule (above) is applied. See Mantero and Ishwaran (2017) for details.

Survival analysis:

The default rule is

`logrank`

which implements log-rank splitting (Segal, 1988; Leblanc and Crowley, 1993).`logrankscore`

implements log-rank score splitting (Hothorn and Lausen, 2003).

Competing risk analysis:

The default rule is

`logrankCR`

which implements a modified weighted log-rank splitting rule modeled after Gray's test (Gray, 1988).`logrank`

implements weighted log-rank splitting where each event type is treated as the event of interest and all other events are treated as censored. The split rule is the weighted value of each of log-rank statistics, standardized by the variance. For more details see Ishwaran et al. (2014).

Custom splitting: All families except unsupervised are available for user defined custom splitting. Some basic C-programming skills are required. The harness for defining these rules is in

`splitCustom.c`

. In this file we give examples of how to code rules for regression, classification, survival, and competing risk. Each family can support up to sixteen custom split rules. Specifying`splitrule="custom"`

or`splitrule="custom1"`

will trigger the first split rule for the family defined by the training data set. Multivariate families will need a custom split rule for both regression and classification. In the examples, we demonstrate how the user is presented with the node specific membership. The task is then to define a split statistic based on that membership. Take note of the instructions in`splitCustom.c`

on how to*register*the custom split rules. It is suggested that the existing custom split rules be kept in place for reference and that the user proceed to develop`splitrule="custom2"`

and so on. The package must be recompiled and installed for the custom split rules to become available.Random splitting. For all families, pure random splitting can be invoked by setting

`splitrule="random"`

. See below for more details regarding randomized splitting rules.

*Allowable data types*Data types must be real valued, integer, factor or logical -- however all except factors are coerced and treated as if real valued. For ordered x-variable factors, splits are similar to real valued variables. If the x-variable factor is unordered, a split will move a subset of the levels in the parent node to the left daughter, and the complementary subset to the right daughter. All possible complementary pairs are considered and apply to factors with an unlimited number of levels. However, there is an optimization check to ensure that the number of splits attempted is not greater than the number of cases in a node (this internal check will override the

`nsplit`

value in random splitting mode if`nsplit`

is large enough; see below for information about`nsplit`

).*Improving computational speed*See the function

`rfsrcFast`

for a fast implementation of`rfsrc`

. In general, the key methods for increasing speed are as follows:*Randomized splitting rules*Trees tend to favor splits on continuous variables and factors with large numbers of levels (Loh and Shih, 1997). To mitigate this bias and improve speed, randomized splitting can be invoked using the option

`nsplit`

. If`nsplit`

is set to a non-zero positive integer, then a maximum of`nsplit`

split points are chosen randomly for each of the`mtry`

variables within a node and only these points are used to determine the best split. Pure random splitting can be invoked by setting`splitrule="random"`

. In this case, a variable is randomly selected and the node is split using a random split point (Cutler and Zhao, 2001; Lin and Jeon, 2006). Note when pure random splitting is in effect,`nsplit`

is set to one.*Subsampling*Subsampling can be used to reduce the size of the in-sample data used to grow a tree and therefore can greatly reduce computational load. Subsampling is implemented using options

`sampsize`

and`samptype`

.*Unique time points*For large survival problems, users should consider setting

`ntime`

to a reasonably small value (such as 50 or 100). This constrains ensemble calculations such as survival functions to a restricted grid of time points of length no more than`ntime`

and considerably reduces computational times.*Large number of variables*Use the default setting of

`importance="none"`

which turns off variable importance (VIMP) calculations and considerably reduces computational times when there are a large number of variables (see below for more details about variable importance). Variable importance calculations can always be recovered later using functions`vimp`

or`predict`

. Also consider using the function`max.subtree`

which calculates minimal depth, a measure of the depth that a variable splits, and yields fast variable selection (Ishwaran, 2010).*Factors*For coherence, an immutable map is applied to each factor that ensures that factor levels in the training data set are consistent with the factor levels in any subsequent test data set. This map is applied to each factor before and after the native C library is executed. Because of this, if x-variables are all factors, then computational times may become very long in high dimensional problems. Consider converting factors to real if this is the case.

*Prediction Error*Prediction error is calculated using OOB data. Performance is measured in terms of mean-squared-error for regression, and misclassification error for classification. A normalized Brier score (relative to a coin-toss) is also provided upon printing a classification forest.

For survival, prediction error is measured by 1-C, where C is Harrell's (Harrell et al., 1982) concordance index. Prediction error is between 0 and 1, and measures how well the predictor correctly ranks (classifies) two random individuals in terms of survival. A value of 0.5 is no better than random guessing. A value of 0 is perfect.

When bootstrapping is

`by.node`

or`none`

, a coherent OOB subset is not available to assess prediction error. Thus, all outputs dependent on this are suppressed. In such cases, prediction error is only available via classical cross-validation (the user will need to use the`predict.rfsrc`

function).*Variable Importance (VIMP)*To calculate VIMP, use the option

`importance`

. Classical permutation VIMP is implemented when`permute`

or`TRUE`

is selected. In this case, OOB cases for a variable*x*are randomly permuted and dropped down a tree. VIMP is calculated by comparing OOB prediction performance for the permuted predictor to the original predictor.The exact calculation for VIMP depends upon

`block.size`

(an integer value between`1`

and`ntree`

) specifying the number of trees in a block used to determine VIMP. When the value is`1`

, VIMP is calculated by tree (blocks of size 1). Specifically, the difference between prediction error under the perturbed predictor and the original predictor is calculated for each tree and averaged over the forest. This yields the original Breiman-Cutler VIMP (Breiman 2001).When

`block.size`

is set to`ntree`

, VIMP is calculated by comparing the error rate for the perturbed OOB forest ensemble (using all trees) to the unperturbed OOB forest ensemble (using all trees). Thus, unlike Breiman-Cutler VIMP, ensemble VIMP does not measure the tree average effect of*x*, but rather its overall forest effect. This is called Ishwaran-Kogalur VIMP (Ishwaran et al. 2008).A useful compromise between Breiman-Cutler (BC) and Ishwaran-Kogalur (IK) VIMP can be obtained by setting

`block.size`

to a value between`1`

and`ntree`

. Smaller values are closer to BC and larger values closer to IK. Smaller generally gives better accuracy, however computational times will be higher because VIMP is calculated over more blocks.The option

`importance`

permits different ways to perturb a variable. If`random`

is specified, then instead of permuting*x*, OOB case are assigned a daughter node randomly whenever a split on*x*is encountered. If`anti`

is specified,*x*is assigned to the opposite node whenever a split on*x*is encountered.Note that the option

`none`

turns VIMP off entirely.Note that the function

`vimp`

provides a friendly user interface for extracting VIMP.*Multivariate Forests*Multivariate forests are specified by using the multivariate formula interface. Such a call takes one of two forms:

rfsrc(Multivar(y1, y2, ..., yd) ~ . , my.data, ...)

rfsrc(cbind(y1, y2, ..., yd) ~ . , my.data, ...)

A multivariate normalized composite splitting rule is used to split nodes. The nature of the outcomes will inform the code as to the type of multivariate forest to be grown; i.e. whether it is real-valued, categorical, or a combination of both (mixed). Note that performance measures (when requested) are returned for all outcomes.

*Unsupervised Forests*In the case where no y-outcomes are present, unsupervised forests can be invoked by one of two means:

rfsrc(data = my.data)

rfsrc(Unsupervised() ~ ., data = my.data)

To split a tree node, a random subset of

`ytry`

variables are selected from the available features, and these variables function as "pseudo-responses" to be split on. Thus, in unsupervised mode, the features take turns acting as both target y-outcomes and x-variables for splitting.More precisely, as in supervised forests,

`mtry`

x-variables are randomly selected from the set of`p`

features for splitting the node. Then on each`mtry`

loop,`ytry`

variables are selected from the`p`

-1 remaining features to act as the target pseduo-responses to be split on (there are`p`

-1 possibilities because we exclude the currently selected x-variable for the current`mtry`

loop --- also, only pseudo-responses that pass purity checks are used). The split-statistic for splitting the node on the pseudo-responses using the x-variable is calculated. The best split over the`mtry`

pairs is used to split the node.The default value of

`ytry`

is 1 but can be increased by the`ytry`

option. A value larger than 1 initiates multivariate splitting. As illustration, consider the call:rfsrc(data = my.data, ytry = 5, mtry = 10)

This is equivalent to the call:

rfsrc(Unsupervised(5) ~ ., my.data, mtry = 10)

In the above, a node will be split by selecting

`mtry=10`

x-variables, and for each of these a random subset of 5 features will be selected as the multivariate pseudo-responses. The split-statistic is a multivariate normalized composite splitting rule which is applied to each of the 10 multivariate regression problems. The node is split on the variable leading to the best split.Note that all performance values (error rates, VIMP, prediction) are turned off in unsupervised mode.

*Survival, Competing Risks*Survival settings require a time and censoring variable which should be identifed in the formula as the response using the standard

`Surv`

formula specification. A typical formula call looks like:Surv(my.time, my.status) ~ .

where

`my.time`

and`my.status`

are the variables names for the event time and status variable in the users data set.For survival forests (Ishwaran et al. 2008), the censoring variable must be coded as a non-negative integer with 0 reserved for censoring and (usually) 1=death (event). The event time must be non-negative.

For competing risk forests (Ishwaran et al., 2013), the implementation is similar to survival, but with the following caveats:

Censoring must be coded as a non-negative integer, where 0 indicates right-censoring, and non-zero values indicate different event types. While 0,1,2,..,J is standard, and recommended, events can be coded non-sequentially, although 0 must always be used for censoring.

Setting the splitting rule to

`logrankscore`

will result in a survival analysis in which all events are treated as if they are the same type (indeed, they will coerced as such).Generally, competing risks requires a larger

`nodesize`

than survival settings.

*Missing data imputation*Setting

`na.action="na.impute"`

imputes missing data (both x and y-variables) using a modification of the missing data algorithm of Ishwaran et al. (2008). See also Tang and Ishwaran (2017). Split statistics are calculated using non-misssing data only. If a node splits on a variable with missing data, the variable's missing data is imputed by randomly drawing values from non-missing in-bag data. The purpose of this is to make it possible to assign cases to daughter nodes based on the split. Following a node split, imputed data are reset to missing and the process is repeated until terminal nodes are reached. Missing data in terminal nodes are imputed using in-bag non-missing terminal node data. For integer valued variables and censoring indicators, imputation uses a maximal class rule, whereas continuous variables and survival time use a mean rule.The missing data algorithm can be iterated by setting

`nimpute`

to a positive integer greater than 1. Using only a few iterations are needed to improve accuracy. When the algorithm is iterated, at the completion of each iteration, missing data is imputed using OOB non-missing terminal node data which is then used as input to grow a new forest. Note that when the algorithm is iterated, a side effect is that missing values in the returned objects`xvar`

,`yvar`

are replaced by imputed values. Further, imputed objects such as`imputed.data`

are set to`NULL`

. Also, keep in mind that if the algorithm is iterated, performance measures such as error rates and VIMP become optimistically biased.Finally, records in which all outcome and x-variable information are missing are removed from the forest analysis. Variables having all missing values are also removed.

See the function

`impute`

for a fast impute interface.

##### Value

An object of class `(rfsrc, grow)`

with the following
components:

The original call to `rfsrc`

.

The family used in the analysis.

Sample size of the data (depends upon `NA`

's, see `na.action`

).

Number of trees grown.

Number of variables randomly selected for splitting at each node.

Minimum size of terminal nodes.

Maximum depth allowed for a tree.

Splitting rule used.

Number of randomly selected split points.

y-outcome values.

A character vector of the y-outcome names.

Data frame of x-variables.

A character vector of the x-variable names.

Vector of non-negative weights specifying the probability used to select a variable for splitting a node.

Vector of non-negative weights where entry
`k`

, after normalizing, is the multiplier by which the
split statistic for a covariate is adjusted.

Vector of weights used for the composite competing risk splitting rule.

Number of terminal nodes for each tree in the
forest. Vector of length `ntree`

. A value of zero indicates
a rejected tree (can occur when imputing missing data).
Values of one indicate tree stumps.

Proximity matrix recording the frequency of pairs of data points occur within the same terminal node.

If `forest=TRUE`

, the forest object is returned.
This object is used for prediction with new test data
sets and is required for other R-wrappers.

Forest weight matrix.

Matrix recording terminal node membership where each column contains the node number that a case falls in for that tree.

Splitting rule used.

Matrix recording inbag membership where each column contains the number of times that a case appears in the bootstrap sample for that tree.

Count of the number of times a variable is used in growing the forest.

Vector of indices for cases with missing values.

Data frame of the imputed data. The first column(s) are reserved for the y-responses, after which the x-variables are listed.

Matrix [i][j] or array [i][j][k] recording the minimal depth for variable [j] for case [i], either averaged over the forest, or by tree [k].

Split statistics returned when
`statistics=TRUE`

which can be parsed using `stat.split`

.

Tree cumulative OOB error rate.

Variable importance (VIMP) for each x-variable.

In-bag predicted value.

OOB predicted value.

for classification settings, additionally ++++++++

In-bag predicted class labels.

OOB predicted class labels.

for multivariate settings, additionally ++++++++

List containing performance values for multivariate regression responses (applies only in multivariate settings).

List containing performance values for multivariate categorical (factor) responses (applies only in multivariate settings).

for survival settings, additionally ++++++++

In-bag survival function.

OOB survival function.

In-bag cumulative hazard function (CHF).

OOB CHF.

Ordered unique death times.

Number of deaths.

for competing risks, additionally ++++++++

In-bag cause-specific cumulative hazard function (CSCHF) for each event.

OOB CSCHF.

In-bag cumulative incidence function (CIF) for each event.

OOB CIF.

Ordered unique event times.

Number of events.

##### Note

Values returned depend heavily on the family. In particular,
`predicted`

and `predicted.oob`

are the following values
calculated using in-bag and OOB data:

For regression, a vector of predicted y-responses.

For classification, a matrix with columns containing the estimated class probability for each class. Performance values and VIMP for classification are reported as a matrix with J+1 columns where J is the number of classes. The first column "all" is the unconditional value for performance or VIMP, while the remaining columns are performance and VIMP conditioned on cases corresponding to that class label.

For survival, a vector of mortality values (Ishwaran et al., 2008) representing estimated risk for each individual calibrated to the scale of the number of events (as a specific example, if

*i*has a mortality value of 100, then if all individuals had the same x-values as*i*, we would expect an average of 100 events). Also returned are matrices containing the CHF and survival function. Each row corresponds to an individual's ensemble CHF or survival function evaluated at each time point in`time.interest`

.For competing risks, a matrix with one column for each event recording the expected number of life years lost due to the event specific cause up to the maximum follow up (Ishwaran et al., 2013). Also returned are the cause-specific cumulative hazard function (CSCHF) and the cumulative incidence function (CIF) for each event type. These are encoded as a three-dimensional array, with the third dimension used for the event type, each time point in

`time.interest`

making up the second dimension (columns), and the case (individual) being the first dimension (rows).For multivariate families, predicted values (and other performance values such as VIMP and error rates) are stored in the lists

`regrOutput`

and`clasOutput`

which can be parsed using the functions`get.mv.error`

,`get.mv.predicted`

and`get.mv.vimp`

.

##### References

Breiman L., Friedman J.H., Olshen R.A. and Stone C.J.
*Classification and Regression Trees*, Belmont, California, 1984.

Breiman L. (2001). Random forests, *Machine Learning*, 45:5-32.

Cutler A. and Zhao G. (2001). Pert-Perfect random tree ensembles.
*Comp. Sci. Statist.*, 33: 490-497.

Gray R.J. (1988). A class of k-sample tests for comparing the
cumulative incidence of a competing risk, *Ann. Statist.*,
16: 1141-1154.

Harrell et al. F.E. (1982). Evaluating the yield of medical tests,
*J. Amer. Med. Assoc.*, 247:2543-2546.

Hothorn T. and Lausen B. (2003). On the exact distribution of maximally selected
rank statistics, *Comp. Statist. Data Anal.*, 43:121-137.

Ishwaran H. (2007). Variable importance in binary regression
trees and forests, *Electronic J. Statist.*, 1:519-537.

Ishwaran H. and Kogalur U.B. (2007). Random survival forests for R,
*Rnews*, 7(2):25-31.

Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S.
(2008). Random survival forests, *Ann. App.
Statist.*, 2:841-860.

Ishwaran H., Kogalur U.B., Gorodeski E.Z, Minn A.J. and
Lauer M.S. (2010). High-dimensional variable selection for survival
data. *J. Amer. Statist. Assoc.*, 105:205-217.

Ishwaran H., Kogalur U.B., Chen X. and Minn A.J. (2011). Random survival
forests for high-dimensional data. *Stat. Anal. Data Mining*, 4:115-132

Ishwaran H., Gerds T.A., Kogalur U.B., Moore R.D., Gange S.J. and Lau
B.M. (2014). Random survival forests for competing risks.
*Biostatistics*, 15(4):757-773.

Ishwaran H. and Malley J.D. (2014). Synthetic learning
machines. *BioData Mining*, 7:28.

Ishwaran H. (2015). The effect of splitting on random forests.
*Machine Learning*, 99:75-118.

Ishwaran H. and Lu M. (2018). Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Statistics in Medicine (in press).

Lin Y. and Jeon Y. (2006). Random forests and adaptive nearest
neighbors, *J. Amer. Statist. Assoc.*, 101:578-590.

LeBlanc M. and Crowley J. (1993). Survival trees by goodness of split,
*J. Amer. Statist. Assoc.*, 88:457-467.

Loh W.-Y and Shih Y.-S (1997). Split selection methods for
classification trees, *Statist. Sinica*, 7:815-840.

Mantero A. and Ishwaran H. (2017). Unsupervised random forests.

Mogensen, U.B, Ishwaran H. and Gerds T.A. (2012). Evaluating random
forests for survival analysis using prediction error curves,
*J. Statist. Software*, 50(11): 1-23.

O'Brien R. and Ishwaran H. (2017). A random forests quantile classifier for class imbalanced data.

Segal M.R. (1988). Regression trees for censored data,
*Biometrics*, 44:35-47.

Tang F. and Ishwaran H. (2017). Random forest missing data
algorithms. *Statistical Analysis and Data Mining*, 10, 363-377.

##### See Also

`plot.competing.risk`

,
`plot.rfsrc`

,
`plot.survival`

,
`plot.variable`

,
`predict.rfsrc`

,
`print.rfsrc`

,
`quantileReg`

,
`rfsrcFast`

,
`rfsrcSyn`

,

##### Examples

```
# NOT RUN {
##------------------------------------------------------------
## Survival analysis
##------------------------------------------------------------
## veteran data
## randomized trial of two treatment regimens for lung cancer
data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran,
ntree = 100, block.size = 1)
## print and plot the grow object
print(v.obj)
plot(v.obj)
## plot survival curves for first 10 individuals -- direct way
matplot(v.obj$time.interest, 100 * t(v.obj$survival.oob[1:10, ]),
xlab = "Time", ylab = "Survival", type = "l", lty = 1)
## plot survival curves for first 10 individuals -- use wrapper
plot.survival(v.obj, subset = 1:10)
# }
# NOT RUN {
## Primary biliary cirrhosis (PBC) of the liver
data(pbc, package = "randomForestSRC")
pbc.obj <- rfsrc(Surv(days, status) ~ ., pbc)
print(pbc.obj)
##------------------------------------------------------------
## Example of imputation in survival analysis
##------------------------------------------------------------
data(pbc, package = "randomForestSRC")
pbc.obj2 <- rfsrc(Surv(days, status) ~ ., pbc,
nsplit = 10, na.action = "na.impute")
## same as above but we iterate the missing data algorithm
pbc.obj3 <- rfsrc(Surv(days, status) ~ ., pbc,
na.action = "na.impute", nimpute = 3)
## fast way to impute the data (no inference is done)
## see impute for more details
pbc.imp <- impute(Surv(days, status) ~ ., pbc, splitrule = "random")
##------------------------------------------------------------
## Compare RF-SRC to Cox regression
## Illustrates C-index and Brier score measures of performance
## assumes "pec" and "survival" libraries are loaded
##------------------------------------------------------------
if (library("survival", logical.return = TRUE)
& library("pec", logical.return = TRUE)
& library("prodlim", logical.return = TRUE))
{
##prediction function required for pec
predictSurvProb.rfsrc <- function(object, newdata, times, ...){
ptemp <- predict(object,newdata=newdata,...)$survival
pos <- sindex(jump.times = object$time.interest, eval.times = times)
p <- cbind(1,ptemp)[, pos + 1]
if (NROW(p) != NROW(newdata) || NCOL(p) != length(times))
stop("Prediction failed")
p
}
## data, formula specifications
data(pbc, package = "randomForestSRC")
pbc.na <- na.omit(pbc) ##remove NA's
surv.f <- as.formula(Surv(days, status) ~ .)
pec.f <- as.formula(Hist(days,status) ~ 1)
## run cox/rfsrc models
## for illustration we use a small number of trees
cox.obj <- coxph(surv.f, data = pbc.na, x = TRUE)
rfsrc.obj <- rfsrc(surv.f, pbc.na, ntree = 150)
## compute bootstrap cross-validation estimate of expected Brier score
## see Mogensen, Ishwaran and Gerds (2012) Journal of Statistical Software
set.seed(17743)
prederror.pbc <- pec(list(cox.obj,rfsrc.obj), data = pbc.na, formula = pec.f,
splitMethod = "bootcv", B = 50)
print(prederror.pbc)
plot(prederror.pbc)
## compute out-of-bag C-index for cox regression and compare to rfsrc
rfsrc.obj <- rfsrc(surv.f, pbc.na)
cat("out-of-bag Cox Analysis ...", "\n")
cox.err <- sapply(1:100, function(b) {
if (b%%10 == 0) cat("cox bootstrap:", b, "\n")
train <- sample(1:nrow(pbc.na), nrow(pbc.na), replace = TRUE)
cox.obj <- tryCatch({coxph(surv.f, pbc.na[train, ])}, error=function(ex){NULL})
if (!is.null(cox.obj)) {
randomForestSRC:::cindex(pbc.na$days[-train],
pbc.na$status[-train],
predict(cox.obj, pbc.na[-train, ]))
} else NA
})
cat("\n\tOOB error rates\n\n")
cat("\tRSF : ", rfsrc.obj$err.rate[rfsrc.obj$ntree], "\n")
cat("\tCox regression : ", mean(cox.err, na.rm = TRUE), "\n")
}
##------------------------------------------------------------
## Competing risks
##------------------------------------------------------------
## WIHS analysis
## cumulative incidence function (CIF) for HAART and AIDS stratified by IDU
data(wihs, package = "randomForestSRC")
wihs.obj <- rfsrc(Surv(time, status) ~ ., wihs, nsplit = 3, ntree = 100)
plot.competing.risk(wihs.obj)
cif <- wihs.obj$cif.oob
Time <- wihs.obj$time.interest
idu <- wihs$idu
cif.haart <- cbind(apply(cif[,,1][idu == 0,], 2, mean),
apply(cif[,,1][idu == 1,], 2, mean))
cif.aids <- cbind(apply(cif[,,2][idu == 0,], 2, mean),
apply(cif[,,2][idu == 1,], 2, mean))
matplot(Time, cbind(cif.haart, cif.aids), type = "l",
lty = c(1,2,1,2), col = c(4, 4, 2, 2), lwd = 3,
ylab = "Cumulative Incidence")
legend("topleft",
legend = c("HAART (Non-IDU)", "HAART (IDU)", "AIDS (Non-IDU)", "AIDS (IDU)"),
lty = c(1,2,1,2), col = c(4, 4, 2, 2), lwd = 3, cex = 1.5)
## illustrates the various splitting rules
## illustrates event specific and non-event specific variable selection
if (library("survival", logical.return = TRUE)) {
## use the pbc data from the survival package
## events are transplant (1) and death (2)
data(pbc, package = "survival")
pbc$id <- NULL
## modified Gray's weighted log-rank splitting
pbc.cr <- rfsrc(Surv(time, status) ~ ., pbc)
## log-rank event-one specific splitting
pbc.log1 <- rfsrc(Surv(time, status) ~ ., pbc,
splitrule = "logrank", cause = c(1,0), importance = TRUE)
## log-rank event-two specific splitting
pbc.log2 <- rfsrc(Surv(time, status) ~ ., pbc,
splitrule = "logrank", cause = c(0,1), importance = TRUE)
## extract VIMP from the log-rank forests: event-specific
## extract minimal depth from the Gray log-rank forest: non-event specific
var.perf <- data.frame(md = max.subtree(pbc.cr)$order[, 1],
vimp1 = 100 * pbc.log1$importance[ ,1],
vimp2 = 100 * pbc.log2$importance[ ,2])
print(var.perf[order(var.perf$md), ])
}
## ------------------------------------------------------------
## Regression analysis
## ------------------------------------------------------------
## New York air quality measurements
airq.obj <- rfsrc(Ozone ~ ., data = airquality, na.action = "na.impute")
# partial plot of variables (see plot.variable for more details)
plot.variable(airq.obj, partial = TRUE, smooth.lines = TRUE)
## motor trend cars
mtcars.obj <- rfsrc(mpg ~ ., data = mtcars)
## ------------------------------------------------------------
## Classification analysis
## ------------------------------------------------------------
## Edgar Anderson's iris data
iris.obj <- rfsrc(Species ~., data = iris)
## Wisconsin prognostic breast cancer data
data(breast, package = "randomForestSRC")
breast.obj <- rfsrc(status ~ ., data = breast, block.size=1)
plot(breast.obj)
## ------------------------------------------------------------
## Classification analysis with class imbalanced data
## ------------------------------------------------------------
data(breast, package = "randomForestSRC")
breast <- na.omit(breast)
o <- rfsrc(status ~ ., data = breast)
print(o)
## The data is imbalanced so we use balanced random forests
## with undersampling of the majority class
##
## Specifically let n0, n1 be sample sizes for majority, minority
## cases. We sample 2 x n1 cases with majority, minority cases chosen
## with probabilities n1/n, n0/n where n=n0+n1
y <- breast$status
o <- rfsrc(status ~ ., data = breast,
case.wt = randomForestSRC:::make.wt(y),
sampsize = randomForestSRC:::make.size(y))
print(o)
## ------------------------------------------------------------
## Unsupervised analysis
## ------------------------------------------------------------
# two equivalent ways to implement unsupervised forests
mtcars.unspv <- rfsrc(Unsupervised() ~., data = mtcars)
mtcars2.unspv <- rfsrc(data = mtcars)
## ------------------------------------------------------------
## Multivariate regression analysis
## ------------------------------------------------------------
mtcars.mreg <- rfsrc(Multivar(mpg, cyl) ~., data = mtcars,
block.size=1, importance = TRUE)
## extract error rates, vimp, and OOB predicted values for all targets
err <- get.mv.error(mtcars.mreg)
vmp <- get.mv.vimp(mtcars.mreg)
pred <- get.mv.predicted(mtcars.mreg)
## standardized error and vimp
err.std <- get.mv.error(mtcars.mreg, standardize = TRUE)
vmp.std <- get.mv.vimp(mtcars.mreg, standardize = TRUE)
## ------------------------------------------------------------
## Mixed outcomes analysis
## ------------------------------------------------------------
mtcars.new <- mtcars
mtcars.new$cyl <- factor(mtcars.new$cyl)
mtcars.new$carb <- factor(mtcars.new$carb, ordered = TRUE)
mtcars.mix <- rfsrc(cbind(carb, mpg, cyl) ~., data = mtcars.new, block.size=1)
print(mtcars.mix, outcome.target = "mpg")
print(mtcars.mix, outcome.target = "cyl")
plot(mtcars.mix, outcome.target = "mpg")
plot(mtcars.mix, outcome.target = "cyl")
## ------------------------------------------------------------
## Custom splitting using the pre-coded examples
## ------------------------------------------------------------
## motor trend cars
mtcars.obj <- rfsrc(mpg ~ ., data = mtcars, splitrule = "custom")
## iris analysis
iris.obj <- rfsrc(Species ~., data = iris, splitrule = "custom1")
## WIHS analysis
wihs.obj <- rfsrc(Surv(time, status) ~ ., wihs, nsplit = 3,
ntree = 100, splitrule = "custom1")
# }
```

*Documentation reproduced from package randomForestSRC, version 2.8.0, License: GPL (>= 3)*