Learn R Programming

randomForestSRC (version 1.6.1)

rfsrc: Random Forests for Survival, Regression and Classification (RF-SRC)

Description

A random forest (Breiman, 2001) is grown using user supplied training data. Applies when the response (outcome) is numeric, categorical (factor), or right-censored (including competing risk), and yields regression, classification, and survival forests, respectively. The resulting forest, informally referred to as a RF-SRC object, contains many useful values which can be directly extracted by the user and/or parsed using additional functions (see the examples below). This is the main entry point to the randomForestSRC package.

The package implements OpenMP shared-memory parallel programming. However, the default installation will only execute serially. Users should consult the randomForestSRC-package help file for details on installing the OpenMP version of the package. The help file is readily accessible via the command package?randomForestSRC.

Usage

rfsrc(formula, data, ntree = 1000,
  bootstrap = c("by.root", "by.node", "none"),
  mtry = NULL,
  nodesize = NULL,
  nodedepth = NULL,
  splitrule = NULL,
  nsplit = 0,
  split.null = FALSE,
  importance = c("permute", "random", "permute.ensemble",
                 "random.ensemble", "none"),
  na.action = c("na.omit", "na.impute", "na.random"),
  nimpute = 1,
  ntime,
  cause,
  xvar.wt = NULL,  
  proximity = FALSE,
  forest = TRUE,  
  var.used = c(FALSE, "all.trees", "by.tree"),
  split.depth = c(FALSE, "all.trees", "by.tree"),
  seed = NULL,
  do.trace = FALSE,
  membership = TRUE,
  statistics = FALSE,
  fast.restore = FALSE,
  ...)

Arguments

formula
A symbolic description of the model to be fit.
data
Data frame containing the y-outcome and x-variables in the model.
ntree
Number of trees in the forest.
bootstrap
Bootstrap protocol. The default is by.root which bootstraps the data by sampling with replacement at the root node before growing the tree. If by.node is choosen, the data is bootstrapped at each node during t
mtry
Number of variables randomly selected as candidates for each node split. The default is sqrt(p), except for regression families where p/3 is used, where p equals the number of variables. Values are
nodesize
Minimum number of unique cases (data points) in a terminal node. The defaults are: survival (3), competing risk (6), regression (5), classification (1), mixed outcomes (3).
nodedepth
Maximum depth to which a tree should be grown. The default behaviour is that this parameter is ignored.
splitrule
Splitting rule used to grow trees. Available rules are as follows: [object Object],[object Object],[object Object],[object Object]
nsplit
Non-negative integer value. If non-zero, the specified tree splitting rule is randomized which can significantly increase speed.
split.null
Set this value to TRUE when testing the null hypothesis. In particular, this assumes there is no relationship between y and x.
importance
Method for computing variable importance (VIMP). Calculating VIMP can be computationally expensive when the number of variables is high, thus if VIMP is not needed consider setting importance="none". See the details below f
na.action
Action taken if the data contains NA's. Possible values are na.omit, na.impute or na.random. The default na.omit removes the entire record if even one of its entries
nimpute
Number of iterations of the missing data algorithm. Performance measures such as out-of-bag (OOB) error rates tend to become optimistic if nimpute is greater than 1.
ntime
Integer value used for survival families to constrain ensemble calculations to a grid of time values of no more than ntime time points. Alternatively if a vector of values of length greater than one is supplied, it is assum
cause
Integer value between 1 and J indicating the event of interest for competing risks, where J is the number of event types (this option applies only to competing risks and is ignored otherwise). While growing a tree, t
xvar.wt
Vector of non-negative weights where entry k, after normalizing, is the probability of selecting variable k as a candidate for splitting a node. Default is to use uniform weights. Vector must be of dimen
proximity
Should the proximity between observations be calculated? Creates an nxn matrix, which can be large. Choices are "inbag", "oob", "all", TRUE, or FALSE
forest
Should the forest object be returned? Used for prediction on new data and required by many of the functions used to parse the RF-SRC object.
var.used
Return variables used for splitting? Default is FALSE. Possible values are all.trees and by.tree.
split.depth
Records the minimal depth for each variable. Default is FALSE. Possible values are all.trees and by.tree. Used for variable selection.
seed
Negative integer specifying seed for the random number generator.
do.trace
Logical. Should trace output be enabled on each iteration? Default is FALSE.
membership
Should terminal node membership and inbag information be returned?
statistics
Should split statistics be returned? Values can be parsed using stat.split.
fast.restore
Should forest restoration be fast? Be aware that this option uses significantly more RAM and may not be suitable for all platforms depending on the size of the data set being analyzed.
...
Further arguments passed to or from other methods.

Value

  • An object of class (rfsrc, grow) with the following components:
  • callThe original call to rfsrc.
  • formulaThe formula used in the call.
  • familyThe family used in the analysis.
  • nSample size of the data (depends upon NA's, see ).
  • ntreeNumber of trees grown.
  • mtryNumber of variables randomly selected for splitting at each node.
  • nodesizeMinimum size of terminal nodes.
  • nodedepthMaximum depth allowed for a tree.
  • splitruleSplitting rule used.
  • nsplitNumber of randomly selected split points.
  • yvary-outcome values.
  • yvar.namesA character vector of the y-outcome names.
  • xvarData frame of x-variables.
  • xvar.namesA character vector of the x-variable names.
  • xvar.wtVector of non-negative weights specifying the probability used to select a variable for splitting a node.
  • split.wtVector of non-negative weights where entry k, after normalizing, is the multiplier by which the split statistic for a covariate is adjusted.
  • leaf.countNumber of terminal nodes for each tree in the forest. Vector of length . A value of zero indicates a rejected tree (can occur when imputing missing data). Values of one indicate tree stumps.
  • proximityProximity matrix recording the frequency of pairs of data points occur within the same terminal node.
  • forestIf , the forest object is returned. This object is used for prediction with new test data sets and is required for other R-wrappers.
  • membershipMatrix recording terminal node membership where each column contains the node number that a case falls in for that tree.
  • inbagMatrix recording inbag membership where each column contains the number of times that a case appears in the bootstrap sample for that tree.
  • var.usedCount of the number of times a variable is used in growing the forest.
  • imputed.indvVector of indices for cases with missing values.
  • imputed.dataData frame of the imputed data. The first column(s) are reserved for the y-responses, after which the x-variables are listed.
  • split.depthMatrix [i][j] or array [i][j][k] recording the minimal depth for variable [j] for case [i], either averaged over the forest, or by tree [k].
  • node.statsSplit statistics returned when statistics=TRUE which can be parsed using stat.split.
  • err.rateTree cumulative OOB error rate.
  • importanceVariable importance (VIMP) for each x-variable.
  • predictedIn-bag predicted value.
  • predicted.oobOOB predicted value.
  • ...... classfor classification settings, additionally the following ......
  • classIn-bag predicted class labels.
  • class.oobOOB predicted class labels.
  • ...... survfor survival settings, additionally the following ......
  • survivalIn-bag survival function.
  • survival.oobOOB survival function.
  • chfIn-bag cumulative hazard function (CHF).
  • chf.oobOOB CHF.
  • time.interestOrdered unique death times.
  • ndeadNumber of deaths.
  • ...... surv-CRfor competing risks, additionally the following ......
  • chfIn-bag cause-specific cumulative hazard function (CSCHF) for each event.
  • chf.oobOOB CSCHF.
  • cifIn-bag cumulative incidence function (CIF) for each event.
  • cif.oobOOB CIF.
  • time.interestOrdered unique event times.
  • ndeadNumber of events.

Details

  1. FamiliesThere are four families of random forests:regr,class,surv, andsurv-CR.
    • Regression forests (regr) for continuous responses.
    • Classification forests (class) for factor responses.
    • Survival forest (surv) for right-censored survival settings.
    • Competing risk survival forests (surv-CR) for competing risk scenarios.
    See below for how to code the response in the two different survival scenarios.
  2. Allowable data types and issues related to factorsData types must be real valued, integer, factor or logical -- however all except factors are coerced and treated as if real valued. For ordered factors, splits are similar to real valued variables. If the factor is unordered, a split will move a subset of the levels in the parent node to the left daughter, and the complementary subset to the right daughter. All possible complementary pairs are considered and apply to factors with an unlimited number of levels. However, there is an optimization check to ensure that the number of splits attempted is not greater than the number of cases in a node (this internal check will override thensplitvalue in random splitting mode ifnsplitis large enough; see below for information aboutnsplit).
  3. Randomized Splitting RulesA random version of a splitting rule can be invoked using. Ifis set to a non-zero positive integer, then a maximum ofsplit points are chosen randomly for each of thevariables within a node (this is in contrast to non-random splitting, i.e., where all possible split points for each of thevariables are considered). The splitting rule is applied to the random split points and the node is split on that variable and random split point yielding the best value (as measured by the splitting rule).

    Pure random splitting can be invoked by setting. For each node, a variable is randomly selected and the node is split using a random split point (Cutler and Zhao, 2001; Lin and Jeon, 2006).

    Trees tend to favor splits on continuous variables (Loh and Shih, 1997), so it is good practice to use theoption when the data contains a mix of continuous and discrete variables. Using a reasonably small value mitigates bias and may not compromise accuracy.

  4. Fast SplittingThe value ofhas a significant impact on the time taken to grow a forest. When non-random splitting is in effect (), iterating over each split point can sometimes be CPU intensive. However, whennsplit> 0, or when pure random splitting is in effect, CPU times are drastically reduced.
  5. Variable Importance (VIMP)The optionallows four distinct ways to calculate VIMP. The defaultpermutereturns Breiman-Cutler permutation VIMP as described in Breiman (2001). For each tree, the prediction error on the out-of-bag (OOB) data is recorded. Then for a given variablex, OOB cases are randomly permuted inxand the prediction error is recorded. The VIMP forxis defined as the difference between the perturbed and unperturbed error rate averaged over all trees. Ifrandomis used, thenxis not permuted, but rather an OOB case is assigned a daughter node randomly whenever a split onxis encountered in the in-bag tree. The OOB error rate is compared to the OOB error rate without randomly splitting onx. The VIMP is the difference averaged over the forest. If the optionspermute.ensembleorrandom.ensembleare used, then VIMP is calculated by comparing the error rate for the perturbed OOB forest ensemble to the unperturbed OOB forest ensemble where the perturbed ensemble is obtained by either permutingxor by random daughter node assignments for splits onx. Thus, unlike the Breiman-Cutler measure, here VIMP does not measure the tree average effect ofx, but rather its overall forest effect. See Ishwaran et al. (2008) for further details. Finally, the optionnoneturns VIMP off entirely.

    Note that the functionvimpprovides a friendly user interface for extracting VIMP.

  6. Prediction ErrorPrediction error is calculated using OOB data. Performance is measured in terms of mean-squared-error for regression and misclassification error for classification.

    For survival, prediction error is measured by 1-C, where C is Harrell's (Harrell et al., 1982) concordance index. Prediction error is between 0 and 1, and measures how well the predictor correctly ranks (classifies) two random individuals in terms of survival. A value of 0.5 is no better than random guessing. A value of 0 is perfect.

    When bootstrapping isby.nodeornone, a coherent OOB subset is not available to assess prediction error. Thus, all outputs dependent on this are suppressed. In such cases, prediction error is only available via classical cross-validation (the user will need to usepredict.rfsrc, for example).

  7. Survival, Competing Risks
    1. Survival settings require a time and censoring variable which should be identifed in the formula as the response using the standardSurvformula specification. A typical formula call looks like:

      Surv(my.time, my.status) ~ .

      wheremy.timeandmy.statusare the variables names for the event time and status variable in the users data set.

    2. For survival forests (Ishwaran et al. 2008), the censoring variable must be coded as a non-negative integer with 0 reserved for censoring and (usually) 1=death (event). The event time must be non-negative.
    3. For competing risk forests (Ishwaran et al., 2013), the implementation is similar to survival, but with the following caveats:
      • Censoring must be coded as a non-negative integer, where 0 indicates right-censoring, and non-zero values indicate different event types. While {0,1,2,..,J} is standard, and recommended, events can be coded non-sequentially, although 0 must always be used for censoring.
      • Setting the splitting rule tologrankscorewill result in a survival analysis in which all events are treated as if they are the same type (indeed, they will coerced as such).
      • Generally, competing risks requires a largernodesizethan survival settings.
  8. Missing data imputationSettingimputes missing data (both x and y-variables) using a modification of the missing data algorithm of Ishwaran et al. (2008). Prior to splitting a node, missing data for a variable is imputed by randomly drawing values from non-missing in-bag data. The purpose of this imputed data is to make it possible to assign cases to daughter nodes in the event the node is split on a variable with missing data. Imputed data is however not used to calculate the split-statistic which uses non-missing data only. Following a node split, imputed data are reset to missing and the process is repeated until terminal nodes are reached. Missing data in terminal nodes are imputed using OOB non-missing terminal node data. For integer valued variables and censoring indicators, imputation uses a maximal class rule, whereas continuous variables and survival time use a mean rule.

    Choosing the optionimplements a cruder version of the missing data algorithm but which should be computationally faster. Unlike the default method, data is not imputed as the tree is grown, instead tree nodes are split using non-missing in-bag data. Following the split to a node, data points with missing values on the variable used to split the node are randomly assigned to daughter nodes. When terminal nodes are reached, missing data are imputed as before by using out-of-bag (OOB) non-missing terminal node data.

    Both missing data algorithms can be iterated by settingnimputeto a positive integer greater than 1. Using only a few iterations are needed to improve accuracy. When the algorithm is iterated, at the completion of each iteration, missing data is imputed using OOB non-missing terminal node data which is then used as input to grow a new forest. Note that when the algorithm is iterated, a side effect is that missing values in the returned objectsxvar,yvarare replaced by imputed values. Further, imputed objects such asimputed.dataare set toNULL. Also, keep in mind that if the algorithm is iterated, performance measures such as error rates and VIMP become optimistically biased.

    Regardless of what method is used, records in which all outcome and x-variable information are missing are removed from the forest analysis. Variables having all missing values are also removed.

    See the functionimpute.rfsrcfor a fast impute interface.

  9. Large sample sizeFor increased efficiency for survival families, users should consider settingto a relatively small value when the sample size (number of rows of the data) is large. This constrains ensemble calculations such as survival functions to a restricted grid of time points of length no more thanwhich can considerably reduce computational times.
  10. Large number of variablesFor increased efficiency when the number of variables are large, setimportance="none"which turns off VIMP calculations and can considerably reduce computational times. Note that VIMP calculations can always be recovered later from the grow forest using the functionvimp.
  11. Miscellanea
    1. Settingreturns a vector of sizepwhere each element is a count of the number of times a split occurred on a variable. If, a matrix of sizentreexpis returned. Each element [i][j] is the count of the number of times a split occurred on variable [j] in tree [i].
    2. Settingreturns a matrix of sizenxpwhere entry [i][j] is the minimal depth for variable [j] for case [i] averaged over the forest. That is, for case [i], the entry [i][j] records the first time case [i] splits on variable [j] averaged over the forest. If, a three-dimensional array is returned where the third dimension [k] records the tree and the first two coordinates [i][j] record the case and the variable. Thus entry [i][j][k] is the minimal depth for case [i] for variable [j] for tree [k].

References

Breiman L., Friedman J.H., Olshen R.A. and Stone C.J. Classification and Regression Trees, Belmont, California, 1984.

Breiman L. (2001). Random forests, Machine Learning, 45:5-32.

Cutler A. and Zhao G. (2001). Pert-Perfect random tree ensembles. Comp. Sci. Statist., 33: 490-497.

Gray R.J. (1988). A class of k-sample tests for comparing the cumulative incidence of a competing risk, Ann. Statist., 16: 1141-1154.

Harrell et al. F.E. (1982). Evaluating the yield of medical tests, J. Amer. Med. Assoc., 247:2543-2546.

Hothorn T. and Lausen B. (2003). On the exact distribution of maximally selected rank statistics, Comp. Statist. Data Anal., 43:121-137.

Ishwaran H. (2007). Variable importance in binary regression trees and forests, Electronic J. Statist., 1:519-537.

Ishwaran H. and Kogalur U.B. (2007). Random survival forests for R, Rnews, 7(2):25-31.

Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S. (2008). Random survival forests, Ann. App. Statist., 2:841-860.

Ishwaran H., Kogalur U.B., Gorodeski E.Z, Minn A.J. and Lauer M.S. (2010). High-dimensional variable selection for survival data. J. Amer. Statist. Assoc., 105:205-217.

Ishwaran H. (2014). The effect of splitting on random forests. Machine Learning (in press).

Ishwaran H., Gerds T.A., Kogalur U.B., Moore R.D., Gange S.J. and Lau B.M. (2014). Random survival forests for competing risks. Biostatistics (in press).

Lin Y. and Jeon Y. (2006). Random forests and adaptive nearest neighbors, J. Amer. Statist. Assoc., 101:578-590.

LeBlanc M. and Crowley J. (1993). Survival trees by goodness of split, J. Amer. Statist. Assoc., 88:457-467.

Loh W.-Y and Shih Y.-S (1997). Split selection methods for classification trees, Statist. Sinica, 7:815-840.

Mogensen, U.B, Ishwaran H. and Gerds T.A. (2012). Evaluating random forests for survival analysis using prediction error curves, J. Statist. Software, 50(11): 1-23.

Segal M.R. (1988). Regression trees for censored data, Biometrics, 44:35-47.

See Also

find.interaction, impute.rfsrc, max.subtree, plot.competing.risk, plot.rfsrc, plot.survival, plot.variable, predict.rfsrc, print.rfsrc, rf2rfz, rfsrcSyn, stat.split, var.select, vimp

Examples

Run this code
##------------------------------------------------------------
## Survival analysis
##------------------------------------------------------------

## veteran data
## randomized trial of two treatment regimens for lung cancer
data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100)

# print and plot the grow object
print(v.obj)
plot(v.obj)

# plot survival curves for first 10 individuals: direct way
matplot(v.obj$time.interest, 100 * t(v.obj$survival[1:10, ]),
    xlab = "Time", ylab = "Survival", type = "l", lty = 1)

# plot survival curves for first 10 individuals
# indirect way: using plot.survival (also generates hazard plots)
plot.survival(v.obj, subset = 1:10, haz.model = "ggamma")

## Primary biliary cirrhosis (PBC) of the liver

data(pbc, package = "randomForestSRC")
pbc.obj <- rfsrc(Surv(days, status) ~ ., pbc, nsplit = 10)
print(pbc.obj)


##------------------------------------------------------------
## Example of imputation in survival analysis
##------------------------------------------------------------

data(pbc, package = "randomForestSRC")
pbc.obj2 <- rfsrc(Surv(days, status) ~ ., pbc,
           nsplit = 10, na.action = "na.impute")


# here's a nice wrapper to combine original data + imputed data
combine.impute <- function(object) {
 impData <- cbind(object$yvar, object$xvar)
 if (!is.null(object$imputed.indv)) {
   impData[object$imputed.indv, ] <- object$imputed.data
 }
 impData
}

# combine original data + imputed data
pbc.imp.data <- combine.impute(pbc.obj2)

# same as above but we iterate the missing data algorithm
pbc.obj3 <- rfsrc(Surv(days, status) ~ ., pbc, nsplit=10,
         na.action = "na.impute", nimpute = 3)
pbc.iterate.imp.data <- combine.impute(pbc.obj3)

# fast way to impute the data (no inference is done)
# see impute.rfsc for more details
pbc.fast.imp.data <- impute.rfsrc(data = pbc, nsplit = 10, nimpute = 5)

##------------------------------------------------------------
## Compare RF-SRC to Cox regression
## Illustrates C-index and Brier score measures of performance
## assumes "pec" and "survival" libraries are loaded
##------------------------------------------------------------

if (library("survival", logical.return = TRUE)
    & library("pec", logical.return = TRUE)
    & library("prodlim", logical.return = TRUE)
    & library("Hmisc", logical.return = TRUE))  
{
  ##prediction function required for pec
  predictSurvProb.rfsrc <- function(object, newdata, times, ...){
    ptemp <- predict(object,newdata=newdata,...)$survival
    pos <- sindex(jump.times = object$time.interest, eval.times = times)
    p <- cbind(1,ptemp)[, pos + 1]
    if (NROW(p) != NROW(newdata) || NCOL(p) != length(times))
      stop("Prediction failed")
    p
  }

  ## data, formula specifications
  data(pbc, package = "randomForestSRC")
  pbc.na <- na.omit(pbc)  ##remove NA's
  surv.f <- as.formula(Surv(days, status) ~ .)
  pec.f <- as.formula(Hist(days,status) ~ 1)

  ## run cox/rfsrc models
  ## for illustration we use a small number of trees
  cox.obj <- coxph(surv.f, data = pbc.na)
  rfsrc.obj <- rfsrc(surv.f, pbc.na, nsplit = 10, ntree = 150)

  ## compute bootstrap cross-validation estimate of expected Brier score
  ## see Mogensen, Ishwaran and Gerds (2012) Journal of Statistical Software
  set.seed(17743)
  prederror.pbc <- pec(list(cox.obj,rfsrc.obj), data = pbc.na, formula = pec.f,
                        splitMethod = "bootcv", B = 50)
  print(prederror.pbc)
  plot(prederror.pbc)

  ## compute out-of-bag C-index for cox regression and compare to rfsrc
  rfsrc.obj <- rfsrc(surv.f, pbc.na, nsplit = 10)
  cat("out-of-bag Cox Analysis ...", "\n")
  cox.err <- sapply(1:100, function(b) {
    if (b%%10 == 0) cat("cox bootstrap:", b, "\n")
    train <- sample(1:nrow(pbc.na), nrow(pbc.na), replace = TRUE)
    cox.obj <- tryCatch({coxph(surv.f, pbc.na[train, ])}, error=function(ex){NULL})
    if (is.list(cox.obj)) {
      rcorr.cens(predict(cox.obj, pbc.na[-train, ]),
                 Surv(pbc.na$days[-train],
                 pbc.na$status[-train]))[1]
    } else NA
  })
  cat("\n\tOOB error rates\n\n")
  cat("\tRSF            : ", rfsrc.obj$err.rate[rfsrc.obj$ntree], "\n")
  cat("\tCox regression : ", mean(cox.err, na.rm = TRUE), "\n")
}

##------------------------------------------------------------
## Competing risks
##------------------------------------------------------------

## WIHS analysis
## cumulative incidence function (CIF) for HAART and AIDS stratified by IDU

data(wihs, package = "randomForestSRC")
wihs.obj <- rfsrc(Surv(time, status) ~ ., wihs, nsplit = 3, ntree = 100)
plot.competing.risk(wihs.obj)
cif <- wihs.obj$cif
Time <- wihs.obj$time.interest
idu <- wihs$idu
cif.haart <- cbind(apply(cif[,,1][idu == 0,], 2, mean),
                   apply(cif[,,1][idu == 1,], 2, mean))
cif.aids  <- cbind(apply(cif[,,2][idu == 0,], 2, mean),
                   apply(cif[,,2][idu == 1,], 2, mean))
matplot(Time, cbind(cif.haart, cif.aids), type = "l",
        lty = c(1,2,1,2), col = c(4, 4, 2, 2), lwd = 3,
        ylab = "Cumulative Incidence")
legend("topleft",
       legend = c("HAART (Non-IDU)", "HAART (IDU)", "AIDS (Non-IDU)", "AIDS (IDU)"),
       lty = c(1,2,1,2), col = c(4, 4, 2, 2), lwd = 3, cex = 1.5)


## illustrates the various splitting rules
## illustrates event specific and non-event specific variable selection
if (library("survival", logical.return = TRUE)) {

  ## use the pbc data from the survival package
  ## events are transplant (1) and death (2)
  data(pbc, package = "survival")
  pbc$id <- NULL

  ## modified Gray's weighted log-rank splitting
  pbc.cr <- rfsrc(Surv(time, status) ~ ., pbc, nsplit = 10)

  ## log-rank event-one specific splitting
  pbc.log1 <- rfsrc(Surv(time, status) ~ ., pbc, nsplit = 10,
              splitrule = "logrank", cause = c(1,0))

  ## log-rank event-two specific splitting
  pbc.log2 <- rfsrc(Surv(time, status) ~ ., pbc, nsplit = 10,
              splitrule = "logrank", cause = c(0,1))

  ## extract VIMP from the log-rank forests: event-specific
  ## extract minimal depth from the Gray log-rank forest: non-event specific
  var.perf <- data.frame(md = max.subtree(pbc.cr)$order[, 1],
                         vimp1 = pbc.log1$importance[,1],
                         vimp2 = pbc.log2$importance[,2])
  print(var.perf[order(var.perf$md), ])

}



## ------------------------------------------------------------
## Regression analysis
## ------------------------------------------------------------

## New York air quality measurements
airq.obj <- rfsrc(Ozone ~ ., data = airquality, na.action = "na.impute")

# partial plot of variables (see plot.variable for more details)
plot.variable(airq.obj, partial = TRUE, smooth.lines = TRUE)

## motor trend cars
mtcars.obj <- rfsrc(mpg ~ ., data = mtcars)

# minimal depth variable selection via max.subtree
md.obj <- max.subtree(mtcars.obj)
cat("top variables:\n")
print(md.obj$topvars)

# equivalent way to select variables
# see var.select for more details
vs.obj <- var.select(object = mtcars.obj)


## ------------------------------------------------------------
## Classification analysis
## ------------------------------------------------------------

## Edgar Anderson's iris data
iris.obj <- rfsrc(Species ~., data = iris)

## Wisconsin prognostic breast cancer data
data(breast, package = "randomForestSRC")
breast.obj <- rfsrc(status ~ ., data = breast, nsplit = 10)
plot(breast.obj)

Run the code above in your browser using DataLab