bigrf (version 0.1-12)

bigrf-package: Big Random Forests: Classification and Regression Forests for Large Data Sets

Description

This is an implementation of Leo Breiman's and Adele Cutler's Random Forest algorithms for classification and regression, with optimizations for performance and for handling of data sets that are too large to be processed in memory. Forests can be built in parallel at two levels. First, trees can be built in parallel on a single machine using foreach. Second, multiple forests can be built in parallel on multiple machines, then merged into one. For large data sets, disk-based big.matrix's may be used for storing data and intermediate computations, to prevent excessive virtual memory swapping by the operating system. Currently, only classification forests with a subset of the functionality in Breiman and Cutler's original code are implemented. More functionality and regression trees will be added in the future. See file INSTALL-WINDOWS in the source package for Windows installation instructions.

Arguments

Performance Optimizations

For better performance, trees may be grown in parallel by registering an appropriate parallel backend for foreach. As an example, the following code uses the doParallel package to enable tree-growing on all available cores on the machine. This code must be executed before calling bigrfc or grow. See foreach for more details on supported parallel backends.
    library(doParallel)
    registerDoParallel(cores=detectCores(all.tests=TRUE))
  
Multiple random forests can also be built in parallel on multiple machines (using the same training data and parameters), then merged into one forest using merge. For large data sets, the training data, intermediate computations and some outputs (e.g. proximity matrices) may be cached on disk using big.matrix objects. This enables random forests to be built on fairly large data sets without hitting RAM limits, which will cause excessive virtual memory swapping by the operating system. Disk caching may be turned off for optimal performance on smaller data sets by setting function / method argument cachepath to NULL, causing the big.matrix's to be created in memory.

Details

Package:
bigrf
Version:
0.1-12
Date:
2015-10-21
OS_type:
unix
Depends:
R (>= 2.14), methods, bigmemory (>= 4.5.8)
Imports:
foreach
Suggests:
MASS, doParallel
LinkingTo:
bigmemory, BH
License:
GPL-3
Copyright:
2013-2015 Aloysius Lim
URL:
https://github.com/aloysius-lim/bigrf
BugReports:
https://github.com/aloysius-lim/bigrf/issues

Index:

    bigcforest-class        Classification Random Forests
    bigcprediction-class    Random Forest Predictions
    bigctree-class          Classification Trees in Random
                            Forests
    bigrf-package           Big Random Forests: Classification
                            and Regression Forests for Large Data
                            Sets
    bigrfc                  Build a Classification Random Forest
                            Model
    bigrfprox-class         Proximity Matrices
    fastimp-methods         Compute Fast (Gini) Variable
                            Importance
    generateSyntheticClass  Generate Synthetic Second Class for
                            Unsupervised Learning
    grow-methods            Grow More Trees in a Random Forest
    interactions-methods    Compute Variable Interactions
    merge-methods           Merge Two Random Forests
    outliers-methods        Compute Outlier Scores
    predict-methods         Predict Classes of Test Examples
    proximities-methods     Compute Proximity Matrix
    scaling-methods         Compute Metric Scaling Co-ordinates
    varimp-methods          Compute Variable Importance
  

The main entry point for this package is bigrfc, which is used to build a classification random forest on the given training data and forest-building parameters. bigrfc returns the forest as an object of class "bigcforest", which contains the trees grown as objects of class "bigctree". After a forest is built, more trees can be grown using grow.

References

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

Breiman, L. & Cutler, A. (n.d.). Random Forests. Retrieved from http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm.

See Also

randomForest cforest