Learn R Programming

Rborist (version 0.1-1)

Rborist: Rapid Decision Tree Construction and Evaluation

Description

Accelerated implementation of the Random Forest (trademarked name) algorithm. Tuned for multicore and GPU hardware. Bindable with most numerical front-end languages in addtion to R. Invocation is similar to that provided by "randomForest" package.

Usage

"Rborist" (x, y, nTree=500, withRepl = TRUE, ctgCensus = "votes", classWeight = NULL, minInfo = 0.01, minNode = ifelse(is.factor(y), 2, 5), nLevel = 0, noValidate = FALSE, nSamp = 0, predFixed = 0, predProb = 0.0, predWeight = NULL, quantVec = NULL, quantiles = !is.null(quantVec), qBin = 5000, regMono = NULL, rowWeight = NULL, treeBlock = 1, pvtBlock = 8, ...)

Arguments

x
the design matrix expressed as a PreTrain object, as a data.frame object with numeric and/or factor columns or as a numeric matrix.
y
the response (outcome) vector, either numerical or categorical. Row count must conform with x.
nTree
the number of trees to train.
withRepl
whether row sampling is by replacement.
ctgCensus
report categorical validation by vote or by probability.
classWeight
proportional weighting of classification categories.
minInfo
information ratio with parent below which node does not split.
minNode
minimum number of distinct row references to split a node.
nLevel
maximum number of tree levels to train. Zero denotes no limit.
noValidate
whether to train without validation.
nSamp
number of rows to sample, per tree.
predFixed
number of trial predictors for a split (mtry).
predProb
probability of selecting individual predictor as trial splitter.
predWeight
relative weighting of individual predictors as trial splitters.
quantVec
quantile levels to validate.
quantiles
whether to report quantiles at validation.
qBin
bin size for facilating quantiles at large sample count.
regMono
signed probability constraint for monotonic regression.
rowWeight
row weighting for initial sampling of tree.
treeBlock
maximum number of trees to train during a single level (e.g., coprocessor computing).
pvtBlock
maximum number of trees to train in a block (e.g., cluster computing).
...
not currently used.

Value

Rborist, a list containing the following items:
forest
a list containingforestNode a vector of packed structures expressing splitting predictors, splitting values, successor node deltas and leaf indices.origin a vector of tree starting positions within forestNode.facSplit a vector of splitting factor values.facOrigin a vector of tree starting positions within the splitting factor values.
leaf
a list containing either ofBagRow a vector of packed data structures, one per unique row sample, containing a row index and the number of times the row is sampled.LeafReg a list consisting of regression leaf data:node a packed structure expressing leaf scores and node counts.origin a vector of tree starting positions within node.rank a vector, one element per unique row sample, indicating the rank of the sampled row.yRanked a sorted vector of response values.or LeafCtg a list consisting of classification leaf data:node a packed structure expressing leaf scores and node counts.origin a vector of tree starting positions within node.weight a matrix of weighted scores, per category per sample.levels a vector of strings containing the training response levels.
signature
a list consisting of training information needed for separate testing and prediction:nRow the number of rows used to train.predMap a vector mapping core predictor indices back to their respective front-end positions.level a vector of strings vectors representing the training response factor levels.
training
a list containing information gleaned during training:predInfo the information contribution of each predictor.
validation
a list containing the results of validation:ValidReg a list of validation results for regression:yPred a vector containing the predicted response.mse the mean-square error of prediction.rsq the r-squared statistic.qPred a matrix containing the prediction quantiles, if requested.ValidCtg a list of validation results for classification:yPred a vector containing the predicted response.misprediction a vector containing the classwise misprediction rates.confusion a confusion matrix.census a matrix of predictions, by category.prob a matrix of prediction probabilities by category, if requested.

Examples

Run this code
## Not run: 
#   # Regression example:
#   nRow <- 5000
#   x <- data.frame(replicate(6, rnorm(nRow)))
#   y <- with(x, X1^2 + sin(X2) + X3 * X4) # courtesy of S. Welling.
# 
#   # Classification example:
#   data(iris)
# 
#   # Generic invocation:
#   rb <- Rborist(x, y)
# 
# 
#   # Causes 300 trees to be trained:
#   rb <- Rborist(x, y, nTree = 300)
# 
# 
#   # Causes rows to be sampled without replacement:
#   rb <- Rborist(x, y, withRepl=FALSE)
# 
# 
#   # Causes validation census to report class probabilities:
#   rb <- Rborist(iris[-5], iris[5], ctgCensus="prob")
# 
# 
#   # Applies table-weighting to classification categories:
#   rb <- Rborist(iris[-5], iris[5], classWeight = "balance")
# 
# 
#   # Weights first category twice as heavily as remaining two:
#   rb <- Rborist(iris[-5], iris[5], classWeight = c(2.0, 1.0, 1.0))
# 
# 
#   # Does not split nodes when doing so yields less than a 2% gain in
#   # information over the parent node:
#   rb <- Rborist(x, y, minInfo=0.02)
# 
# 
#   # Does not split nodes representing fewer than 10 unique samples:
#   rb <- Rborist(x, y, minNode=10)
# 
# 
#   # Trains a maximum of 20 levels:
#   rb <- Rborist(x, y, nLevel = 20)
# 
# 
#   # Trains, but does not perform subsequent validation:
#   rb <- Rborist(x, y, noValidate=TRUE)
# 
# 
#   # Chooses 500 rows (with replacement) to root each tree.
#   rb <- Rborist(x, y, nSamp=500)
# 
# 
#   # Chooses 2 predictors as splitting candidates at each node (or
#   # fewer, when choices exhausted):
#   rb <- Rborist(x, y, predFixed = 2)  
# 
# 
#   # Causes each predictor to be selected as a splitting candidate with
#   # distribution Bernoulli(0.3):
#   rb <- Rborist(x, y, predProb = 0.3) 
# 
# 
#   # Causes first three predictors to be selected as splitting candidates
#   # twice as often as the other two:
#   rb <- Rborist(x, y, predWeight=c(2.0, 2.0, 2.0, 1.0, 1.0))
# 
# 
#   # Causes (default) quantiles to be computed at validation:
#   rb <- Rborist(x, y, quantiles=TRUE)
#   qPred <- rb$validation$qPred
# 
# 
#   # Causes specfied quantiles (deciles) to be computed at validation:
#   rb <- Rborist(x, y, quantVec = seq(0.1, 1.0, by = 0.10))
#   qPred <- rb$validation$qPred
# 
# 
#   # Causes (default) quantile computation to be approximated by a
#   # small bin size of 100:  fast, but not as accurate:
#   rb <- Rborist(x, y, quantiles = TRUE, qBin = 100).
#   qPred <- rb$validation$qPred
# 
# 
#   # Constrains modelled response to be increasing with respect to X1
#   # and decreasing with respect to X5 (both unlikely).
#   rb <- Rborist(x, y, regMono=c(1.0, 0, 0, 0, -1.0, 0))
# 
# 
#   # Causes rows to be sampled with random weighting:
#   rb <- Rborist(x, y, rowWeight=runif(nRow))
# 
#   ## End(Not run)

Run the code above in your browser using DataLab