Fast OpenMP parallel computing of Breiman random forests (Breiman
2001) for regression, classification, survival analysis (Ishwaran
2008), competing risks (Ishwaran 2012), multivariate (Segal and Xiao
2011), unsupervised (Mantero and Ishwaran 2020), quantile regression
(Meinhausen 2006, Zhang et al. 2019), and class imbalanced
q-classification (O'Brien and Ishwaran 2019). Different splitting
rules invoked under deterministic or random splitting (Geurts et
al. 2006, Ishwaran 2015) are available for all families. Variable
importance (VIMP), and holdout VIMP, as well as confidence regions
(Ishwaran and Lu 2019) can be calculated for single and grouped
variables. Fast interface for missing data imputation using a variety
of different random forest methods (Tang and Ishwaran 2017).
Visualize trees on your Safari or Google Chrome browser (works for all
families, see get.tree
).
This package contains many useful functions and users should read the help file in its entirety for details. However, we briefly mention several key functions that may make it easier to navigate and understand the layout of the package.
This is the main entry point to the package. It grows a random forest
using user supplied training data. We refer to the resulting object
as a RF-SRC grow object. Formally, the resulting object has class
(rfsrc, grow)
.
A fast implementation of rfsrc
using subsampling.
Univariate and multivariate quantile regression forest for training and testing. Different methods available including the Greenwald-Khanna (2001) algorithm, which is especially suitable for big data due to its high memory efficiency.
predict.rfsrc
, predict
Used for prediction. Predicted values are obtained by dropping the
user supplied test data down the grow forest. The resulting object
has class (rfsrc, predict)
.
sidClustering.rfsrc
, sidClustering
Clustering of unsupervised data using SID (Staggered Interaction Data). Also implements the artificial two-class approach of Breiman (2003).
Used for variable selection:
vimp
calculates variable imporance (VIMP) from a
RF-SRC grow/predict object by noising up the variable (for example
by permutation). Note that grow/predict calls can always directly
request VIMP.
subsample
calculates VIMP confidence itervals via
subsampling.
holdout.vimp
measures the importance of a variable
when it is removed from the model.
q-classification and G-mean VIMP for class imbalanced data.
Fast imputation mode for RF-SRC. Both rfsrc
and
predict.rfsrc
are capable of imputing missing data.
However, for users whose only interest is imputing data, this function
provides an efficient and fast interface for doing so.
Used to extract the partial effects of a variable or variables on the ensembles.
Regular stable releases of this package are available on CRAN at https://cran.r-project.org/package=randomForestSRC
Interim unstable development builds with bug fixes and sometimes additional functionality are available at https://github.com/kogalur/randomForestSRC
Bugs may be reported via https://github.com/kogalur/randomForestSRC/issues/new. Please provide the accompanying information with any reports:
sessionInfo()
A minimal reproducible example consisting of the following items:
a minimal dataset, necessary to reproduce the error
the minimal runnable code necessary to reproduce the error, which can be run on the given dataset
the necessary information on the used packages, R version and system it is run on
in the case of random processes, a seed (set by set.seed()
) for reproducibility
This package implements OpenMP shared-memory parallel programming if the target architecture and operating system support it. This is the default mode of execution.
Additional instructions for configuring OpenMP parallel processing are available at https://kogalur.github.io/randomForestSRC/building.html.
An understanding of resource utilization (CPU and RAM) is necessary when running the package using OpenMP and Open MPI parallel execution. Memory usage is greater when running with OpenMP enabled. Diligence should be used not to overtax the hardware available.
With respect to reproducibility, a model is defined by a seed, the topology of the trees in the forest, and terminal node membership of the training data. This allows the user to restore a model and, in particular, its terminal node statistics. On the other hand, VIMP and many other statistics are dependent on additional randomization, which we do not consider part of the model. These statistics are susceptible to Monte Carlo effects.
Breiman L. (2001). Random forests, Machine Learning, 45:5-32.
Geurts, P., Ernst, D. and Wehenkel, L., (2006). Extremely randomized trees. Machine learning, 63(1):3-42.
Ishwaran H. and Kogalur U.B. (2007). Random survival forests for R, Rnews, 7(2):25-31.
Ishwaran H. (2007). Variable importance in binary regression trees and forests, Electronic J. Statist., 1:519-537.
Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S. (2008). Random survival forests, Ann. App. Statist., 2:841-860.
Ishwaran H., Kogalur U.B., Gorodeski E.Z, Minn A.J. and Lauer M.S. (2010). High-dimensional variable selection for survival data. J. Amer. Statist. Assoc., 105:205-217.
Ishwaran H., Kogalur U.B., Chen X. and Minn A.J. (2011). Random survival forests for high-dimensional data. Stat. Anal. Data Mining, 4:115-132
Ishwaran H., Gerds T.A., Kogalur U.B., Moore R.D., Gange S.J. and Lau B.M. (2014). Random survival forests for competing risks. Biostatistics, 15(4):757-773.
Ishwaran H. and Malley J.D. (2014). Synthetic learning machines. BioData Mining, 7:28.
Ishwaran H. (2015). The effect of splitting on random forests. Machine Learning, 99:75-118.
Ishwaran H. and Lu M. (2019). Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Statistics in Medicine, 38, 558-582.
Lu M., Sadiq S., Feaster D.J. and Ishwaran H. (2018). Estimating individual treatment effect in observational data using random forest methods. J. Comp. Graph. Statist, 27(1), 209-219
Mantero A. and Ishwaran H. (2020). Unsupervised random forests. To appear in Statistical Analysis and Data Mining.
Meinshausen N. (2006) Quantile regression forests, Journal of Machine Learning Research, 7:983-999.
O'Brien R. and Ishwaran H. (2019). A random forests quantile classifier for class imbalanced data. Pattern Recognition, 90, 232-249
Segal M.R. and Xiao Y. Multivariate random forests. (2011). Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 1(1):80-87.
Tang F. and Ishwaran H. (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining, 10:363-377.
Zhang H., Zimmerman J., Nettleton D. and Nordman D.J. (2019). Random forest prediction intervals. The American Statistician. 4:1-5.
imbalanced.rfsrc
,
impute.rfsrc
,
partial.rfsrc
,
plot.competing.risk.rfsrc
,
plot.rfsrc
,
plot.survival.rfsrc
,
plot.variable.rfsrc
,
predict.rfsrc
,
print.rfsrc
,
rfsrc
,
rfsrc.cart
,
rfsrc.fast
,