sprint (version 1.0.7)

prandomForest: Parallel random forest generation

Description

The machine learning function prandomForest() is an ensemble tree classifier that constructs a forest of classification trees from bootstrap samples of a dataset in parallel. The random forest algorithm can be used to classify both categorical and continuous variables. This function provides a parallel equivalent to the serial randomForest() function from the randomForest package. Note that the randomForest library must be loaded before calling the prandomForest function. library("randomForest")

N.B. Please see the SPRINT User Guide for how to run the code in parallel using the mpiexec command.

Usage

prandomForest(x, ...) "prandomForest"(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry = if (!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1, maxnodes=NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest = !is.null(y) && is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE, ...)

Arguments

x
array of data
...
optional parameters to be passed to the low level function randomForest.default.
y
vector, if a factor, classification is assumed, otherwise regression is assumed. If omitted, prandomForest() will run in unsupervised mode.
xtest
data array of predictors for the test set
ytest
response for the test set
ntree
integer, the number of trees to grow
mtry
integer, the number of variables randomly sampled as candidates at each split. The default value is sqrt(p) for classification and p/3 for regression, where p is the number of variables in the data matrix x.
replace
boolean, whether the sampling of cases is done with or without replacement. The default value is TRUE.
classwt
vector if priors of the classes. The default value is NULL.
cutoff
vector of k elements where k is the number of classes. The winning class for an observation is the one with the maximum ratio of proportion of votes to cutoff. The default value is 1/k.
strata
variable used for stratified sampling
sampsize
size of sample to draw. For classification, if sampsize is a vector of the length of the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.
nodesize
integer, the minimum size of the terminal nodes. The default value is 1 for classification and 5 for regression.
maxnodes
integer, maximum number of terminal nodes allowed for the trees. The default value is NULL.
importance
boolean, whether the importance of predictors is assessed. The default value is FALSE.
localImp
boolean, whether casewise importance measure is to be computed. The default value is FALSE.
nPerm
integer, the number of times the out-of-bag data are permuted per tree for assessing variable importance. The default value is one. Regression only.
proximity
boolean, whether the proximity measure among the rows is to be calculated.
oob.prox
boolean, whether the proximity is to be calculated for out-of-bag data. The default value is set to be the same as the value of the proximity parameter.
norm.votes
boolean, whether the final result of votes are expressed as fractions or whether the raw vote counts are returned. The default value is TRUE. Classification only.
do.trace
boolean, whether a verbose output is produced. The default value is FALSE. If set to an integer i then the output is printed for every i trees.
keep.forest
boolean, whether the forest is returned in the output object. The default value is FALSE.
corr.bias
boolean, whether to perform a bias correction. The default value is FALSE. Regression only.
keep.inbag
boolean, whether the matrix which keeps track of which samples are in-bag in which trees should be returned. The default value is FALSE.

See Also

randomForest SPRINT