rfsrc_data: Cached `randomForestSRC::rfsrc` objects for examples, diagnostics and vignettes.

Description

Data sets storing randomForestSRC::rfsrc objects corresponding to training data according to the following naming convention:

rfsrc_iris- randomForestSR[C] for theirisdata set.
rfsrc_airq- randomForestS[R]C for theairqualitydata set.
rfsrc_mtcars- randomForestS[R]C for themtcarsdata set.
rfsrc_Boston- randomForestS[R]C for theBostonhousing data set (MASSpackage).
rfsrc_pbc- randomForest[S]RC for thepbcdata set (randomForestSRCpackage)
rfsrc_veteran- randomForest[S]RC for theveterandata set (randomForestSRCpackage)

Arguments

format

randomForestSRC::rfsrc object

Details

Constructing random forests are computationally expensive. We cache randomForestSRC::rfsrc objects to improve the ggRandomForests examples, diagnostics and vignettes run times. (see rebuild_cache_datasets to rebuild a complete set of these data sets.)

For each data set listed, we build a randomForestSRC::rfsrc. Tuning parameters used in each case are documented in the examples. Each data set is built with the rebuild_cache_datasets with the randomForestSRC version listed in the ggRandomForests DESCRIPTION file.

rfsrc_iris- The famous (Fisher's or Anderson's)irisdata set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. Build a classification random forest for predicting the species (setosa, versicolor, and virginica) on 5 variables (columns) and 150 observations (rows).
rfsrc_airq- Theairqualitydata set is from the New York State Department of Conservation (ozone data) and the National Weather Service (meteorological data) collected in New York, from May to September 1973. Build regression random forest for predictingOzoneon 5 covariates and 153 observations.
rfsrc_mtcars- Themtcarsdata was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). Build a regression random forest for predicting mpg on 10 covariates and 32 observations.
rfsrc_Boston- TheBostonhousing values in suburbs of Boston from theMASSpackage. Build a regression random forest for predicting medv (median home values) on 13 covariates and 506 observations.
rfsrc_pbc- Thepbcdata from the Mayo Clinic trial in primary biliary cirrhosis (PBC) of the liver conducted between 1974 and 1984. A total of 424 PBC patients, referred to Mayo Clinic during that ten-year interval, met eligibility criteria for the randomized placebo controlled trial of the drug D-penicillamine. 312 cases participated in the randomized trial and contain largely complete data. Data from therandomForestSRCpackage. Build a survival random forest for time-to-event death data with 17 covariates and 312 observations (remaining 106 observations are held out).
rfsrc_veteran- Veteran's Administration randomized trial of two treatment regimens for lung cancer. Build a survival random forest for time-to-event death data with 6 covariates and 137 observations.

References

#--------------------- randomForestSRC ---------------------

Ishwaran H. and Kogalur U.B. (2014). Random Forests for Survival, Regression and Classification (RF-SRC), R package version 1.5.5.

Ishwaran H. and Kogalur U.B. (2007). Random survival forests for R. R News 7(2), 25-31.

Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S. (2008). Random survival forests. Ann. Appl. Statist. 2(3), 841-860.

#--------------------- airquality data set ---------------------

Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis. Belmont, CA: Wadsworth.

#--------------------- Boston data set ---------------------

Belsley, D.A., E. Kuh, and R.E. Welsch. 1980. Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Harrison, D., and D.L. Rubinfeld. 1978. "Hedonic Prices and the Demand for Clean Air." J. Environ. Economics and Management 5: 81-102.

#--------------------- Iris data set ---------------------

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole. (has iris3 as iris.)

Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179-188.

Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2-5.

#--------------------- mtcars data set ---------------------

Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391-411.

#--------------------- pbc data set ---------------------

Flemming T.R and Harrington D.P., (1991) Counting Processes and Survival Analysis. New York: Wiley.

T Therneau and P Grambsch (2000), Modeling Survival Data: Extending the Cox Model, Springer-Verlag, New York. ISBN: 0-387-98784-3.

#--------------------- veteran data set ---------------------

Kalbfleisch J. and Prentice R, (1980) The Statistical Analysis of Failure Time Data. New York: Wiley.

Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.

Examples

Run this code

#---------------------------------------------------------------------
# iris data - classification random forest
#---------------------------------------------------------------------
# rfsrc grow call
rfsrc_iris <- rfsrc(Species ~., data = iris)

# plot the forest generalization error convergence
gg_dta <- gg_error(rfsrc_iris)
plot(gg_dta)

# Plot the forest predictions
gg_dta <- gg_rfsrc(rfsrc_iris)
plot(gg_dta)
#---------------------------------------------------------------------
# airq data - regression random forest
#---------------------------------------------------------------------
# rfsrc grow call
rfsrc_airq <- rfsrc(Ozone ~ ., data = airquality,
                    na.action = "na.impute")

# plot the forest generalization error convergence
gg_dta <- gg_error(rfsrc_airq)
plot(gg_dta)

# Plot the forest predictions
gg_dta <- gg_rfsrc(rfsrc_airq)
plot(gg_dta)

#---------------------------------------------------------------------
# mtcars data - regression random forest
#---------------------------------------------------------------------
# rfsrc grow call
rfsrc_mtcars <- rfsrc(mpg ~ ., data = mtcars)

# plot the forest generalization error convergence
gg_dta <- gg_error(rfsrc_mtcars)
plot(gg_dta)

# Plot the forest predictions
gg_dta <- gg_rfsrc(rfsrc_mtcars)
plot(gg_dta)

#---------------------------------------------------------------------
# MASS::Boston data - regression random forest
#---------------------------------------------------------------------
# Load the data...
data(Boston, package="MASS")
Boston$chas <- as.logical(Boston$chas)

# rfsrc grow call
rfsrc_Boston <- rfsrc(medv~., data=Boston)

# plot the forest generalization error convergence
gg_dta <- gg_error(rfsrc_Boston)
plot(gg_dta)

# Plot the forest predictions
gg_dta <- gg_rfsrc(rfsrc_Boston)
plot(gg_dta)

#---------------------------------------------------------------------
# randomForestSRC::pbc data - survival random forest
#---------------------------------------------------------------------
# Load the data...
# For simplicity here. We do a bit of data tidying
# before running the stored random forest.
data(pbc, package="randomForestSRC")

# Remove non-randomized cases
dta.train <- pbc[-which(is.na(pbc$treatment)),]

# rfsrc grow call
rfsrc_pbc <- rfsrc(Surv(years, status) ~ ., dta.train, nsplit = 10,
                   na.action="na.impute")

# plot the forest generalization error convergence
gg_dta <- gg_error(rfsrc_pbc)
plot(gg_dta)

# Plot the forest predictions
gg_dta <- gg_rfsrc(rfsrc_pbc)
plot(gg_dta)

#---------------------------------------------------------------------
# randomForestSRC::veteran data - survival random forest
#---------------------------------------------------------------------
# load the data...
# For simplicity. We do a bit of data tidying
# before running the stored random forest.
data(veteran, package="randomForestSRC")

# rfsrc grow call
rfsrc_veteran <- rfsrc(Surv(time, status) ~ ., data = veteran, ...)

# plot the forest generalization error convergence
gg_dta <- gg_error(rfsrc_veteran)
plot(gg_dta)

# Plot the forest predictions
gg_dta <- gg_rfsrc(rfsrc_veteran)
plot(gg_dta)

Run the code above in your browser using DataLab