predict.rfsrc: Prediction for Random Forests for Survival, Regression, and Classification

Description

Obtain predicted values using a forest. Also returns performance values if the test data contains y-outcomes.

Usage

## S3 method for class 'rfsrc':
predict(object, newdata,
  importance = c("permute", "random", "permute.ensemble", "random.ensemble", "none"),
  na.action = c("na.omit", "na.impute"), outcome = c("train", "test"),
  proximity = FALSE, var.used = c(FALSE, "all.trees", "by.tree"),
  split.depth = c(FALSE, "all.trees", "by.tree"), seed = NULL,
  do.trace = FALSE, membership = TRUE, 
  ...)

Arguments

object

An object of class (rfsrc, grow) or

(rfsrc,
	forest)

. Requires in the original rfsrc call.

newdata

Test data. If missing, the original grow (training) data is used.

importance

Method for computing variable importance (VIMP). See rfsrc for details. Only applies when the test data contains y-outcome values.

na.action

Missing value action. The default na.omit removes the entire record if even one of its entries is NA. Use to impute the test data.

outcome

Determines whether the y-outcomes from the training data or the test data are used to calculate the predicted value. The default and natural choice is train which uses the original training data. Note that this option is ig

proximity

Should proximity measure between test observations be calculated? Can be large.

var.used

Record the number of times a variable is split?

split.depth

Return minimal depth for each variable for each case?

seed

Negative integer specifying seed for the random number generator.

do.trace

Should trace output be enabled? Integer values can also be passed. A positive value causes output to be printed each do.trace iteration.

membership

Should terminal node membership and inbag information be returned?

...

Further arguments passed to or from other methods.

Value

An object of class (rfsrc, predict), which is a list with the following components:
callThe original grow call to rfsrc.
familyThe family used in the analysis.
nSample size of test data (depends upon NA values).
ntreeNumber of trees in the grow forest.
yvarTest set y-outcomes or original grow y-outcomes if none.
yvar.namesA character vector of the y-outcome names.
xvarData frame of test set x-variables.
xvar.namesA character vector of the x-variable names.
leaf.countNumber of terminal nodes for each tree in the grow forest. Vector of length ntree.
forestThe grow forest.
proximitySymmetric proximity matrix of the test data.
membershipMatrix recording terminal node membership for the test data where each column contains the node number that a case falls in for that tree.
inbagMatrix recording inbag membership for the test data where each column contains the number of times that a case appears in the bootstrap sample for that tree.
imputed.indvVector of indices of records in test data with missing values.
imputed.dataData frame comprising imputed test data. First columns are the y-outcomes.
split.depthMatrix [i][j] or array [i][j][k] recording the minimal depth for variable [j] for case [i], either averaged over the forest, or by tree [k].
err.rateCumulative OOB error rate for the test data if y-outcomes are present.
importanceTest set variable importance (VIMP). Can be NULL.
predictedTest set predicted value.
predicted.oobOOB predicted value (NULL unless ).
...... classfor classification settings, additionally the following ......
classIn-bag predicted class labels.
class.oobOOB predicted class labels (NULL unless ).
...... survfor survival settings, additionally the following ......
chfCumulative hazard function (CHF).
chf.oobOOB CHF (NULL unless ).
survivalSurvival function.
survival.oobOOB survival function (NULL unless ).
time.interestOrdered unique death times.
ndeadNumber of deaths.
...... surv-CRfor competing risks, additionally the following ......
chfCause-specific cumulative hazard function (CSCHF) for each event.
chf.oobOOB CSCHF for each event (NULL unless ).
cifCumulative incidence function (CIF) for each event.
cif.oobOOB CIF for each event (NULL unless ).
time.interestOrdered unique event times.
ndeadNumber of events.

Details

Predicted values are obtained by dropping test data down the grow forest (the forest grown using the training data). The overall error rate and VIMP are also returned if the test data contains y-outcome values. Single as well as joint VIMP measures can be requested. Note that calculating VIMP can be computationally expensive (especially when the dimension is high), thus if VIMP is not needed, computational times can be significantly improved by setting which turns VIMP off entirely. Setting imputes missing test data (x-variables and/or y-outcomes). Imputation uses the grow-forest such that only training data is used when imputing test data to avoid biasing error rates and VIMP (Ishwaran et al. 2008). If , the predictor is calculated by using y-outcomes from the test data (outcome information must be present). In this case, the terminal nodes from the grow-forest are recalculated using the y-outcomes from the test set. This yields a modified predictor in which the topology of the forest is based solely on the training data, but where the predicted value is based on the test data. Error rates and VIMP are calculated by bootstrapping the test data and using out-of-bagging to ensure unbiased estimates. See the examples for illustration.

References

Breiman L. (2001). Random forests, Machine Learning, 45:5-32.

Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S. (2008). Random survival forests, Ann. App. Statist., 2:841-860. Ishwaran H. and Kogalur U.B. (2007). Random survival forests for R, Rnews, 7(2):25-31.

Examples

Run this code

## ------------------------------------------------------------
## typical train/testing scenario
## ------------------------------------------------------------

data(veteran, package = "randomForestSRC")
train <- sample(1:nrow(veteran), round(nrow(veteran) * 0.80))
veteran.grow <- rfsrc(Surv(time, status) ~ ., veteran[train, ], ntree = 100) 
veteran.pred <- predict(veteran.grow, veteran[-train , ])
print(veteran.grow)
print(veteran.pred)


## ------------------------------------------------------------
## predicted probability and predicted class labels are returned
## in the predict object for classification analyses
## ------------------------------------------------------------

data(breast, package = "randomForestSRC")
breast.obj <- rfsrc(status ~ ., data = breast[(1:100), ], nsplit = 10)
breast.pred <- predict(breast.obj, breast[-(1:100), ])
head(breast.pred$predicted)
breast.pred$class


## ------------------------------------------------------------
## unique feature of randomForestSRC
## cross-validation can be used when factor labels differ over
## training and test data
## ------------------------------------------------------------

# first we convert all x-variables to factors
data(veteran, package = "randomForestSRC")
veteran.factor <- data.frame(lapply(veteran, factor))
veteran.factor$time <- veteran$time
veteran.factor$status <- veteran$status

# split the data into unbalanced train/test data (5/95)
# the train/test data have the same levels, but different labels
train <- sample(1:nrow(veteran), round(nrow(veteran) * .05))
summary(veteran.factor[train,])
summary(veteran.factor[-train,])

# grow the forest on the training data and predict on the test data
veteran.f.grow <- rfsrc(Surv(time, status) ~ ., veteran.factor[train, ]) 
veteran.f.pred <- predict(veteran.f.grow, veteran.factor[-train , ])
print(veteran.f.grow)
print(veteran.f.pred)

## ------------------------------------------------------------
## example illustrating the flexibility of outcome = "test"
## shows how to make a call to predict to obtain the same results
## as the original grow call
## ------------------------------------------------------------

# first we make the grow call
data(pbc, package = "randomForestSRC")
pbc.grow <- rfsrc(Surv(days, status) ~ ., pbc, nsplit = 10)

# now use predict with outcome = TEST
pbc.pred <- predict(pbc.grow, pbc, outcome = "test")

# notice that error rates are the same!!
print(pbc.grow)
print(pbc.pred)

# same ... but without using outcome = "test"
# if predict is called without specifying the test data
# the original training data is used and the predict
# and grow results will automatically be the same
pbc.pred2 <- predict(pbc.grow)
print(pbc.grow)
print(pbc.pred)
print(pbc.pred2)

# similar example, but with na.action = "na.impute"
# note that the predict call must also use na.action = "na.impute"
# for the results to be the same
airq.obj <- rfsrc(Ozone ~ ., data = airquality, na.action = "na.impute")
print(airq.obj)
print(predict(airq.obj, na.action = "na.impute"))

# classification example
iris.obj <- rfsrc(Species ~., data = iris)
print(iris.obj)
print(predict.rfsrc(iris.obj))

## ------------------------------------------------------------
## another example illustrating outcome = "test"
## unique way to check reproducibility of the forest
## ------------------------------------------------------------

# primary call
set.seed(542899)
data(pbc, package = "randomForestSRC")
train <- sample(1:nrow(pbc), round(nrow(pbc) * 0.50))
pbc.out <- rfsrc(Surv(days, status) ~ .,  data=pbc[train, ],
        nsplit = 10)

# standard predict call
pbc.train <- predict(pbc.out, pbc[-train, ], outcome = "train")
#non-standard predict call: overlays the test data on the grow forest
pbc.test <- predict(pbc.out, pbc[-train, ], outcome = "test")

# check forest reproducibilility by comparing "test" predicted survival
# curves to "train" predicted survival curves for the first 3 individuals
Time <- pbc.out$time.interest
matplot(Time, t(exp(-pbc.train$chf)[1:3,]), ylab = "Survival", col = 1, type = "l")
matlines(Time, t(exp(-pbc.test$chf)[1:3,]), col = 2)

Run the code above in your browser using DataLab