bigrf (version 0.1-12)

bigrfc: Build a Classification Random Forest Model

Description

Build a classification random forest model using Leo Breiman and Adele Cutler's algorithm, with enhancements for large data sets. This implementation uses the bigmemory package for disk-based caching during growing of trees, and the foreach package to parallelize the tree-growing process.

Usage

bigrfc(x, y, ntrees = 50L, varselect = NULL, varnlevels = NULL, nsplitvar = round(sqrt(ifelse(is.null(varselect), ncol(x), length(varselect)))), maxeslevels = 11L, nrandsplit = 1023L, maxndsize = 1L, yclasswts = NULL, printerrfreq = 10L, printclserr = TRUE, cachepath = tempdir(), trace = 0L)

Arguments

x
A big.matrix, matrix or data.frame of predictor variables. If a matrix or data.frame is specified, it will be converted into a big.matrix for computation.
y
An integer or factor vector of response variables.
ntrees
The number of trees to be grown in the forest, or 0 to build an empty forest to which trees can be added using grow. Default: 50.
varselect
An integer vector specifying which columns in x to use. If not specified, all variables will be used.
varnlevels
An integer vector with elements specifying the number of levels in the corresponding variables in use, or 0 for numeric variables. Used only when x does not contain levels information (i.e. x is a matrix or big.matrix). If x is a data.frame, varnlevels will be inferred from x. If x is not a data.frame and varnlevels is NULL, all variables will be treated as numeric. If all columns of x are used, varnlevels should have as many elements as there are columns of x. But if varselect is specified, then varnlevels and varselect should be of the same length.
nsplitvar
The number of variables to split on at each node. Default: If varselect is specified, the square root of the number of variables specified; otherwise, the square root of the number of columns of x.
maxeslevels
Maximum number of levels for categorical variables for which exhaustive search of possible splits will be performed. Default: 11. This will amount to searching (2 ^ (11 - 1)) - 1 = 1,023 splits.
nrandsplit
Number of random splits to examine for categorical variables with more than maxeslevels levels. Default: 1,023.
maxndsize
Maximum number of examples in each node when growing the trees. Nodes will be split if they have more than this number of examples. Default: 1.
yclasswts
A numeric vector of class weights, or NULL if all classes should be weighted equally.
printerrfreq
An integer, specifying how often error estimates should be printed to the screen. Default: error estimates will be printed every 10 trees.
printclserr
TRUE for error estimates for individual classes to be printed, in addition to the overall error estimates. Default: TRUE.
cachepath
Path to folder where data caches used in building the forest can be stored. If NULL, then the big.matrix's will be created in memory with no disk caching, which would be suitable for small data sets. If caching is used, some of the cached files can be reused in other methods like varimp, shortening method initialization time. If the user wishes to reuse the cached files in this manner, it is suggested that a folder other than tempdir() is used, as the operating system may automatically delete any cache files in tempdir(). Default: tempdir().
trace
0 for no verbose output. 1 to print verbose output on growing of trees. 2 to print more verbose output on processing of individual nodes. Default: 0. Due to the way %dopar% handles the output of the tree-growing iterations, you may not see the verbose output in some GUIs like RStudio. For best results, run R from the command line in order to see all the verbose output.

Value

An object of class "bigcforest" containing the specified number of trees, which are objects of class "bigctree".

References

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

Breiman, L. & Cutler, A. (n.d.). Random Forests. Retrieved from http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm.

See Also

randomForest cforest

Examples

Run this code
# Classify cars in the Cars93 data set by type (Compact, Large,
# Midsize, Small, Sporty, or Van).

# Load data.
data(Cars93, package="MASS")
x <- Cars93
y <- Cars93$Type

# Select variables with which to train model.
vars <- c(4:22)

# Run model, grow 30 trees.
forest <- bigrfc(x, y, ntree=30L, varselect=vars, cachepath=NULL)

Run the code above in your browser using DataLab