bigrfc: Build a Classification Random Forest Model

Description

Build a classification random forest model using Leo Breiman and Adele Cutler's algorithm, with enhancements for large data sets. This implementation uses the bigmemory package for disk-based caching during growing of trees, and the foreach package to parallelize the tree-growing process.

Usage

bigrfc(x, y, ntrees = 50L, varselect = NULL, varnlevels = NULL, nsplitvar = round(sqrt(ifelse(is.null(varselect), ncol(x), length(varselect)))), maxeslevels = 11L, nrandsplit = 1023L, maxndsize = 1L, yclasswts = NULL, printerrfreq = 10L, printclserr = TRUE, cachepath = tempdir(), trace = 0L)

Arguments

A big.matrix, matrix or data.frame of predictor variables. If a matrix or data.frame is specified, it will be converted into a big.matrix for computation.

An integer or factor vector of response variables.

ntrees

The number of trees to be grown in the forest, or 0 to build an empty forest to which trees can be added using grow. Default: 50.

varselect

An integer vector specifying which columns in x to use. If not specified, all variables will be used.

varnlevels

An integer vector with elements specifying the number of levels in the corresponding variables in use, or 0 for numeric variables. Used only when x does not contain levels information (i.e. x is a matrix or big.matrix). If x is a data.frame, varnlevels will be inferred from x. If x is not a data.frame and varnlevels is NULL, all variables will be treated as numeric. If all columns of x are used, varnlevels should have as many elements as there are columns of x. But if varselect is specified, then varnlevels and varselect should be of the same length.

nsplitvar

The number of variables to split on at each node. Default: If varselect is specified, the square root of the number of variables specified; otherwise, the square root of the number of columns of x.

maxeslevels

Maximum number of levels for categorical variables for which exhaustive search of possible splits will be performed. Default: 11. This will amount to searching (2 ^ (11 - 1)) - 1 = 1,023 splits.

nrandsplit

Number of random splits to examine for categorical variables with more than maxeslevels levels. Default: 1,023.

maxndsize

Maximum number of examples in each node when growing the trees. Nodes will be split if they have more than this number of examples. Default: 1.

yclasswts

A numeric vector of class weights, or NULL if all classes should be weighted equally.

printerrfreq

An integer, specifying how often error estimates should be printed to the screen. Default: error estimates will be printed every 10 trees.

printclserr

TRUE for error estimates for individual classes to be printed, in addition to the overall error estimates. Default: TRUE.

cachepath

Path to folder where data caches used in building the forest can be stored. If NULL, then the big.matrix's will be created in memory with no disk caching, which would be suitable for small data sets. If caching is used, some of the cached files can be reused in other methods like varimp, shortening method initialization time. If the user wishes to reuse the cached files in this manner, it is suggested that a folder other than tempdir() is used, as the operating system may automatically delete any cache files in tempdir(). Default: tempdir().

trace

0 for no verbose output. 1 to print verbose output on growing of trees. 2 to print more verbose output on processing of individual nodes. Default: 0. Due to the way %dopar% handles the output of the tree-growing iterations, you may not see the verbose output in some GUIs like RStudio. For best results, run R from the command line in order to see all the verbose output.

Value

An object of class "bigcforest" containing the specified number of trees, which are objects of class "bigctree".

References

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

Breiman, L. & Cutler, A. (n.d.). Random Forests. Retrieved from http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm.

Examples

Run this code

# Classify cars in the Cars93 data set by type (Compact, Large,
# Midsize, Small, Sporty, or Van).

# Load data.
data(Cars93, package="MASS")
x <- Cars93
y <- Cars93$Type

# Select variables with which to train model.
vars <- c(4:22)

# Run model, grow 30 trees.
forest <- bigrfc(x, y, ntree=30L, varselect=vars, cachepath=NULL)