Rborist: Rapid Decision Tree Construction and Evaluation

Description

Accelerated implementation of the Random Forest (trademarked name) algorithm. Tuned for multicore and GPU hardware. Bindable with most numerical front-end languages in addtion to R. Invocation is similar to that provided by "randomForest" package.

Usage

# S3 method for default
Rborist (x,
                y,
                autoCompress = 0.25,
                ctgCensus = "votes",
                classWeight = NULL,
                maxLeaf = 0,
                minInfo = 0.01,
                minNode = ifelse(is.factor(y), 2, 3),
                nLevel = 0,
                nSamp = 0,
                nThread = 0,
                nTree = 500,
                noValidate = FALSE,
                predFixed = 0,
                predProb = 0.0,
                predWeight = NULL, 
                quantVec = NULL,
                quantiles = !is.null(quantVec),
                regMono = NULL,
                rowWeight = NULL,
                splitQuant = NULL,
                thinLeaves = ifelse(is.factor(y), TRUE, FALSE),
                treeBlock = 1,
                verbose = FALSE,
                withRepl = TRUE,
                ...)

Value

an object of class Rborist, a list containing the following items:

forest

a list containing

forestNode a vector of packed structures expressing splitting predictors, splitting values, successor node deltas and leaf indices.

height a vector of accumulated tree heights within forestNode.

facSplit a vector of splitting factor values.

facHeight a vector of accumulated tree heights positions within the splitting factor values.

a list containing either of:

LeafReg a list consisting of regression leaf data: node a packed structure expressing leaf scores and node counts. nodeHeight a vector of accumulated tree heights within node. bagHeight a vector of accumulated bag counts, per tree. bagSample a vector of packed data structures, one per unique row sample, containing the row index and number of times sampled. yTrain the training response.

LeafCtg a list consisting of classification leaf data: node a packed structure expressing leaf scores and node counts. nodeHeight a vector of accumulated tree heights within node. bagHeight a vector of accumulated bag counts, per tree. bagSample a vector of packed data structures, one per unique row sample, containing the row index and number of times sampled. weight a vector of per-category probabilities, one set for each sampled row. levels a vector of strings containing the training response levels.

bag

a list consisting of bagged row information:

raw a packed bit matrix indicating whether a given row, tree pair is bagged.

nRow the number of rows employed in training.

nTree the number of trained trees.

rowBytes the row stride, in bytes.

training

a list containing information gleaned during training:

call a string containing the original invocation.

info the information contribution of each predictor.

version the version of the Rborist package.

diag strings containing unspecified diagnostic notes and observations.

validation

a list containing the results of validation, if requested:

ValidReg a list of validation results for regression: yPred vector containing the predicted response. mae the mean absolute error of prediction. mse the mean-square error of prediction. rsq the r-squared statistic. qPred matrix containing the prediction quantiles, if requested. ValidCtg list of validation results for classification: yPred vector containing the predicted response. misprediction vector containing the classwise misprediction rates. confusion the confusion matrix. census matrix of predictions, by category. oobError the out-of-bag error. prob matrix of prediction probabilities by category, if requested.

Arguments

x: the design matrix expressed as a PreFormat object, as a data.frame object with numeric and/or factor columns or as a numeric matrix.
y: the response (outcome) vector, either numerical or categorical. Row count must conform with x.
autoCompress: plurality above which to compress predictor values.
ctgCensus: report categorical validation by vote or by probability.
classWeight: proportional weighting of classification categories.
maxLeaf: maximum number of leaves in a tree. Zero denotes no limit.
minInfo: information ratio with parent below which node does not split.
minNode: minimum number of distinct row references to split a node.
nLevel: maximum number of tree levels to train. Zero denotes no limit.
nSamp: number of rows to sample, per tree.
nThread: suggests an OpenMP-style thread count. Zero denotes the default processor setting.
nTree: the number of trees to train.
noValidate: whether to train without validation.
predFixed: number of trial predictors for a split (mtry).
predProb: probability of selecting individual predictor as trial splitter.
predWeight: relative weighting of individual predictors as trial splitters.
quantVec: quantile levels to validate.
quantiles: whether to report quantiles at validation.
regMono: signed probability constraint for monotonic regression.
rowWeight: row weighting for initial sampling of tree.
splitQuant: (sub)quantile at which to place cut point for numerical splits

thinLeaves: bypasses creation of export and quantile state in order to reduce memory footprint.
treeBlock: maximum number of trees to train during a single level (e.g., coprocessor computing).
verbose: indicates whether to output progress of training.
withRepl: whether row sampling is by replacement.
...: not currently used.

Author

Mark Seligman at Suiji.

Examples

Run this code

if (FALSE) {
  # Regression example:
  nRow <- 5000
  x <- data.frame(replicate(6, rnorm(nRow)))
  y <- with(x, X1^2 + sin(X2) + X3 * X4) # courtesy of S. Welling.

  # Classification example:
  data(iris)

  # Generic invocation:
  rb <- Rborist(x, y)


  # Causes 300 trees to be trained:
  rb <- Rborist(x, y, nTree = 300)


  # Causes rows to be sampled without replacement:
  rb <- Rborist(x, y, withRepl=FALSE)


  # Causes validation census to report class probabilities:
  rb <- Rborist(iris[-5], iris[5], ctgCensus="prob")


  # Applies table-weighting to classification categories:
  rb <- Rborist(iris[-5], iris[5], classWeight = "balance")


  # Weights first category twice as heavily as remaining two:
  rb <- Rborist(iris[-5], iris[5], classWeight = c(2.0, 1.0, 1.0))


  # Does not split nodes when doing so yields less than a 2% gain in
  # information over the parent node:
  rb <- Rborist(x, y, minInfo=0.02)


  # Does not split nodes representing fewer than 10 unique samples:
  rb <- Rborist(x, y, minNode=10)


  # Trains a maximum of 20 levels:
  rb <- Rborist(x, y, nLevel = 20)


  # Trains, but does not perform subsequent validation:
  rb <- Rborist(x, y, noValidate=TRUE)


  # Chooses 500 rows (with replacement) to root each tree.
  rb <- Rborist(x, y, nSamp=500)


  # Chooses 2 predictors as splitting candidates at each node (or
  # fewer, when choices exhausted):
  rb <- Rborist(x, y, predFixed = 2)  


  # Causes each predictor to be selected as a splitting candidate with
  # distribution Bernoulli(0.3):
  rb <- Rborist(x, y, predProb = 0.3) 


  # Causes first three predictors to be selected as splitting candidates
  # twice as often as the other two:
  rb <- Rborist(x, y, predWeight=c(2.0, 2.0, 2.0, 1.0, 1.0))


  # Causes (default) quantiles to be computed at validation:
  rb <- Rborist(x, y, quantiles=TRUE)
  qPred <- rb$validation$qPred


  # Causes specfied quantiles (deciles) to be computed at validation:
  rb <- Rborist(x, y, quantVec = seq(0.1, 1.0, by = 0.10))
  qPred <- rb$validation$qPred


  # Constrains modelled response to be increasing with respect to X1
  # and decreasing with respect to X5.
  rb <- Rborist(x, y, regMono=c(1.0, 0, 0, 0, -1.0, 0))


  # Causes rows to be sampled with random weighting:
  rb <- Rborist(x, y, rowWeight=runif(nRow))


  # Suppresses creation of detailed leaf information needed for
  # quantile prediction and external tools.
  rb <- Rborist(x, y, thinLeaves = TRUE)


  # Sets splitting position for predictor 0 to far left and predictor
  # 1 to far right, others to default (median) position.

  spq <- rep(0.5, ncol(x))
  spq[0] <- 0.0
  spq[1] <- 1.0
  rb <- Rborist(x, y, splitQuant = spq)
  }

Run the code above in your browser using DataCamp Workspace