This package provides a unified treatment of Breiman's random forests (Breiman 2001) for a variety of data settings. Regression and classification forests are grown when the response is numeric or categorical (factor), while survival and competing risk forests (Ishwaran et al. 2008, 2012) are grown for right-censored survival data. Multivariate regression and classification responses as well as mixed outcomes (regression/classification responses) are also handled as are unsupervised forests. Different splitting rules invoked under deterministic or random splitting are available for all families. Variable predictiveness can be assessed using variable importance (VIMP) measures for single, as well as grouped variables. Variable selection is implemented using minimal depth variable selection (Ishwaran et al. 2010). Missing data (for x-variables and y-outcomes) can be imputed on both training and test data. The underlying code is based on Ishwaran and Kogalur's now retired randomSurvivalForest package (Ishwaran and Kogalur 2007), and has been significantly refactored for improved computational speed.
This package implements OpenMP shared-memory parallel programming. However, the default installation will only execute serially. To utilize OpenMP, the target architecture and operating system must support it.
There are THREE strategies for installing the OpenMP capable version of the CRAN package. Method 1 relies on having full package development prerequisites, the build environment necessary for creating R packages from source. This is the preferred and comprehensive way to guarantee natively compiled, compatible and optimized binaries for your system. Method 2 relies on having partial package development prerequisites. This makes some assumptions about your system, that are not entirely platform independent, but will usually work. Method 3 is the easiest and does not require package development prerequisites. Method 3 relies on pre-built binaries, and is intended for users not interested in investing the time necessary to have the capability to natively build packages. We do not recommend this method as it does not guarantee OpenMP execution, nor does it guarantee that our binaries will even be compatible with your system. However, we are providing them as a convenience.
METHOD | |
| R Development Toolset |
|
| Difficulty |
|
| OpenMP Execution |
1 |
Full |
High |
Guaranteed |
2 |
Partial |
Medium |
Good Success |
3 |
None |
Low |
Moderate Success |
The core software development utilities required for R package development vary by operating system. The difficulty of installing this build environment also varies by operating system. Unix-based systems are the friendliest, followed by Mac OS X, followed lastly by Windows. Detailed descriptions on how this is achieved are available on a number of sites online and will not be reproduced here. Once the R package development environment is in place, it is possible to build our package natively on your platform using the following steps:
Download the package source code
randomForestSRC_X.x.x.tar.gz
from CRAN at
https://cran.r-project.org/package=randomForestSRC. The
X's indicate the version posted. Do not download the binary!
Open a console, navigate to the directory containing the tarball, and untar it using the command
tar -xvf randomForestSRC_X.x.x.tar.gz
This will create a directory structure with the root directory
of the package named randomForestSRC
. Change into the root
directory of the package using the command
cd randomForestSRC
Run autoconf using the command
autoconf
Change back to your working directory using the command
cd ..
From your working directory, execute the command
R CMD INSTALL --preclean --clean randomForestSRC
on the modified package. Ensure that you do not target the unmodified tarball, but instead act on the directory structure you just modified.
This method hard codes some OpenMP compiler directives. On Windows systems this method is generally easier than Method 1. The instruction for this method follow:
Download the package source code randomForestSRC_X.x.x.tar.gz
from CRAN at https://cran.r-project.org/package=randomForestSRC. The
X's indicate the version posted. Do not
download the binary!
Open a console, navigate to the directory containing the tarball, and untar it using the command
tar -xvf randomForestSRC_X.x.x.tar.gz
This will create a directory structure with the root directory
of the package named randomForestSRC
.
Download the Makevars file containing the custom compiler
directives from
http://www.ccs.miami.edu/~hishwaran/rfsrc/Makevars. Copy it
into the directory randomForestSRC/src
. On Windows
systems, take the additional step of renaming it to
Makevars.win
.
From your working directory, execute the command
R CMD INSTALL --preclean --clean randomForestSRC
on the modified package. Ensure that you do not target the unmodified tarball, but instead act on the directory structure you just modified.
The easiest but least recommended way to install the OpenMP version of the package is to download our pre-compiled binaries and attempt to use them on your system. Successful execution is not always guaranteed. The instruction for this method follow:
Download the platform specific binary file
randomForestSRC_X.x.x.<extension>
from
http://www.ccs.miami.edu/~hishwaran/rfsrc.html.
From the R GUI, navigate to "Install Packages" (or similar) and select "From Local Archive" (or similar). Then navigate to the downloaded binary and click "Install". If you are not using the R GUI, navigate to the directory where you downloaded the compressed binary file and execute the command
R CMD INSTALL randomForestSRC_X.x.x.zip
There are several ways to control the number of CPU cores that the
package accesses during OpenMP parallel execution. First, you will
need to determine the number of cores on your local machine. Do this
by starting an R session and issuing the command
detectCores()
. You will require the parallel package
for this.
Then you can do the following:
At the start of every R session, you can set the number of cores
accessed during OpenMP parallel execution by issuing the command
options(rf.cores = x)
, where x
is the number of
cores. If x
is a negative number, the package will access
the maximum number of cores on your machine. The options command can
also be placed in the users .Rprofile file for convenience. You can,
alternatively, initialize the environment variable RF_CORES
in your shell environment.
The default value for rf.cores is -1 (-1L), if left unspecified, which uses all available cores, with a minimum of two.
The package also implements R-side parallel processing by replacing
the R function lapply
with mclapply
found in the
parallel package. You can set the number of cores accessed by
mclapply
by issuing the command
options(mc.cores = x)
where x
is the number of cores. The options command
can also be placed in the users .Rprofile file for convenience. You
can, alternatively, initialize the environment variable
MC_CORES
in your shell environment. See the help files in
parallel for more information.
The default value for mclapply
on non-Windows systems is
two (2L) cores. On Windows systems, the default value is one (1L)
core.
As an example, issuing the following options command uses all available cores for both OpenMP and R-side processing:
options(rf.cores=detectCores(), mc.cores=detectCores())
As stated above, this option command can be placed in the users .Rprofile file.
Regarding C-side threading (accessed via OpenMP compilation) versus
R-side forking (accessed via mclapply
in package
parallel).
Once the package has been compiled with OpenMP enabled, trees
will be grown in parallel using the rf.cores
option.
Independently of this, we also utilize mclapply
to
parallelize loops in R-side pre-processing and post-processing
of the forest. This is always available and independent of
whether the user chooses to compile the package with the OpenMP
option enabled.
It is important NOT to write programs that fork R processes
containing OpenMP threads. That is, one should not use
mclapply
around the functions rfsrc
,
predict.rfsrc
, vimp.rfsc
,
var.select.rfsrc
, and
find.interaction.rfsrc
. In such a scenario, program
execution is not guaranteed.
Note that options(rf.cores=0)
disables C-side
threading, and options(mc.cores=1)
disables R-side
forking. Therefore, setting options(rf.cores=0)
, is
one means to wrap mclapply
around the functions
listed above in 2.
Regular releases of this package are available on CRAN at https://cran.r-project.org/package=randomForestSRC. Interim development builds with bug fixes and sometimes additional functionality are available at https://github.com/kogalur/randomForestSRC. Bugs may be reported via this page. Please provide the accompanying information with any reports:
sessionInfo()
A minimal reproducible example consisting of the following items:
a minimal dataset, necessary to reproduce the error
the minimal runnable code necessary to reproduce the error, which can be run on the given dataset
the necessary information on the used packages, R version and system it is run on
in the case of random processes, a seed (set by set.seed()) for reproducibility
This package contains many useful functions and users should read the help file in its entirety for details. However, we briefly mention several key functions that may make it easier to navigate and understand the layout of the package.
This is the main entry point to the package. It grows a random forest
using user supplied training data. We refer to the resulting object
as a RF-SRC grow object. Formally, the resulting object has class
(rfsrc, grow)
.
predict.rfsrc
(predict
)
Used for prediction. Predicted values are obtained by dropping the
user supplied test data down the grow forest. The resulting object
has class (rfsrc, predict)
.
Used for variable selection. The function max.subtree
extracts maximal subtree information from a RF-SRC object which is
used for selecting variables by making use of minimal depth variable
selection. The function var.select
provides
an extensive set of variable selection options and is a wrapper to
max.subtree
.
Fast imputation mode for RF-SRC. Both rfsrc
and
predict.rfsrc
are capable of imputing missing data.
However, for users whose only interest is imputing data, this function
provides an efficient and fast interface for doing so.
Breiman L. (2001). Random forests, Machine Learning, 45:5-32.
Ishwaran H. and Kogalur U.B. (2007). Random survival forests for R, Rnews, 7(2):25-31.
Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S. (2008). Random survival forests, Ann. App. Statist., 2:841-860.
Ishwaran H., Kogalur U.B., Gorodeski E.Z, Minn A.J. and Lauer M.S. (2010). High-dimensional variable selection for survival data. J. Amer. Statist. Assoc., 105:205-217.
Ishwaran H., Gerds T.A., Kogalur U.B., Moore R.D., Gange S.J. and Lau B.M. (2014). Random survival forests for competing risks. Biostatistics, 15(4):757-773.
Ishwaran H. (2015). The effect of splitting on random forests. Machine Learning, 99:75-118.
find.interaction
,
impute.rfsrc
,
max.subtree
,
plot.competing.risk
,
plot.rfsrc
,
plot.survival
,
plot.variable
,
predict.rfsrc
,
print.rfsrc
,
rf2rfz
,
rfsrcSyn
,
stat.split
var.select
,
vimp