Rborist (version 0.2-3)

Rborist: Rapid Decision Tree Construction and Evaluation

Description

Accelerated implementation of the Random Forest (trademarked name) algorithm. Tuned for multicore and GPU hardware. Bindable with most numerical front-end languages in addtion to R. Invocation is similar to that provided by "randomForest" package.

Usage

# S3 method for default
Rborist (x,
                y,
                autoCompress = 0.25,
                ctgCensus = "votes",
                classWeight = NULL,
                maxLeaf = 0,
                minInfo = 0.01,
                minNode = ifelse(is.factor(y), 2, 3),
                nLevel = 0,
                nSamp = 0,
                nThread = 0,
                nTree = 500,
                noValidate = FALSE,
                predFixed = 0,
                predProb = 0.0,
                predWeight = NULL, 
                quantVec = NULL,
                quantiles = !is.null(quantVec),
                regMono = NULL,
                rowWeight = NULL,
                splitQuant = NULL,
                thinLeaves = ifelse(is.factor(y), TRUE, FALSE),
                treeBlock = 1,
                verbose = FALSE,
                withRepl = TRUE,
                ...)

Value

an object of class Rborist, a list containing the following items:

forest

a list containing

forestNode a vector of packed structures expressing splitting predictors, splitting values, successor node deltas and leaf indices.

height a vector of accumulated tree heights within forestNode.

facSplit a vector of splitting factor values.

facHeight a vector of accumulated tree heights positions within the splitting factor values.

a list containing either of:

LeafReg a list consisting of regression leaf data: node a packed structure expressing leaf scores and node counts. nodeHeight a vector of accumulated tree heights within node. bagHeight a vector of accumulated bag counts, per tree. bagSample a vector of packed data structures, one per unique row sample, containing the row index and number of times sampled. yTrain the training response.

or

LeafCtg a list consisting of classification leaf data: node a packed structure expressing leaf scores and node counts. nodeHeight a vector of accumulated tree heights within node. bagHeight a vector of accumulated bag counts, per tree. bagSample a vector of packed data structures, one per unique row sample, containing the row index and number of times sampled. weight a vector of per-category probabilities, one set for each sampled row. levels a vector of strings containing the training response levels.

bag

a list consisting of bagged row information:

raw a packed bit matrix indicating whether a given row, tree pair is bagged.

nRow the number of rows employed in training.

nTree the number of trained trees.

rowBytes the row stride, in bytes.

training

a list containing information gleaned during training:

call a string containing the original invocation.

info the information contribution of each predictor.

version the version of the Rborist package.

diag strings containing unspecified diagnostic notes and observations.

validation

a list containing the results of validation, if requested:

ValidReg a list of validation results for regression: yPred vector containing the predicted response. mae the mean absolute error of prediction. mse the mean-square error of prediction. rsq the r-squared statistic. qPred matrix containing the prediction quantiles, if requested. ValidCtg list of validation results for classification: yPred vector containing the predicted response. misprediction vector containing the classwise misprediction rates. confusion the confusion matrix. census matrix of predictions, by category. oobError the out-of-bag error. prob matrix of prediction probabilities by category, if requested.

Arguments

x

the design matrix expressed as a PreFormat object, as a data.frame object with numeric and/or factor columns or as a numeric matrix.

y

the response (outcome) vector, either numerical or categorical. Row count must conform with x.

autoCompress

plurality above which to compress predictor values.

ctgCensus

report categorical validation by vote or by probability.

classWeight

proportional weighting of classification categories.

maxLeaf

maximum number of leaves in a tree. Zero denotes no limit.

minInfo

information ratio with parent below which node does not split.

minNode

minimum number of distinct row references to split a node.

nLevel

maximum number of tree levels to train. Zero denotes no limit.

nSamp

number of rows to sample, per tree.

nThread

suggests an OpenMP-style thread count. Zero denotes the default processor setting.

nTree

the number of trees to train.

noValidate

whether to train without validation.

predFixed

number of trial predictors for a split (mtry).

predProb

probability of selecting individual predictor as trial splitter.

predWeight

relative weighting of individual predictors as trial splitters.

quantVec

quantile levels to validate.

quantiles

whether to report quantiles at validation.

regMono

signed probability constraint for monotonic regression.

rowWeight

row weighting for initial sampling of tree.

splitQuant

(sub)quantile at which to place cut point for numerical splits

.

thinLeaves

bypasses creation of export and quantile state in order to reduce memory footprint.

treeBlock

maximum number of trees to train during a single level (e.g., coprocessor computing).

verbose

indicates whether to output progress of training.

withRepl

whether row sampling is by replacement.

...

not currently used.

Author

Mark Seligman at Suiji.

Examples

Run this code
if (FALSE) {
  # Regression example:
  nRow <- 5000
  x <- data.frame(replicate(6, rnorm(nRow)))
  y <- with(x, X1^2 + sin(X2) + X3 * X4) # courtesy of S. Welling.

  # Classification example:
  data(iris)

  # Generic invocation:
  rb <- Rborist(x, y)


  # Causes 300 trees to be trained:
  rb <- Rborist(x, y, nTree = 300)


  # Causes rows to be sampled without replacement:
  rb <- Rborist(x, y, withRepl=FALSE)


  # Causes validation census to report class probabilities:
  rb <- Rborist(iris[-5], iris[5], ctgCensus="prob")


  # Applies table-weighting to classification categories:
  rb <- Rborist(iris[-5], iris[5], classWeight = "balance")


  # Weights first category twice as heavily as remaining two:
  rb <- Rborist(iris[-5], iris[5], classWeight = c(2.0, 1.0, 1.0))


  # Does not split nodes when doing so yields less than a 2% gain in
  # information over the parent node:
  rb <- Rborist(x, y, minInfo=0.02)


  # Does not split nodes representing fewer than 10 unique samples:
  rb <- Rborist(x, y, minNode=10)


  # Trains a maximum of 20 levels:
  rb <- Rborist(x, y, nLevel = 20)


  # Trains, but does not perform subsequent validation:
  rb <- Rborist(x, y, noValidate=TRUE)


  # Chooses 500 rows (with replacement) to root each tree.
  rb <- Rborist(x, y, nSamp=500)


  # Chooses 2 predictors as splitting candidates at each node (or
  # fewer, when choices exhausted):
  rb <- Rborist(x, y, predFixed = 2)  


  # Causes each predictor to be selected as a splitting candidate with
  # distribution Bernoulli(0.3):
  rb <- Rborist(x, y, predProb = 0.3) 


  # Causes first three predictors to be selected as splitting candidates
  # twice as often as the other two:
  rb <- Rborist(x, y, predWeight=c(2.0, 2.0, 2.0, 1.0, 1.0))


  # Causes (default) quantiles to be computed at validation:
  rb <- Rborist(x, y, quantiles=TRUE)
  qPred <- rb$validation$qPred


  # Causes specfied quantiles (deciles) to be computed at validation:
  rb <- Rborist(x, y, quantVec = seq(0.1, 1.0, by = 0.10))
  qPred <- rb$validation$qPred


  # Constrains modelled response to be increasing with respect to X1
  # and decreasing with respect to X5.
  rb <- Rborist(x, y, regMono=c(1.0, 0, 0, 0, -1.0, 0))


  # Causes rows to be sampled with random weighting:
  rb <- Rborist(x, y, rowWeight=runif(nRow))


  # Suppresses creation of detailed leaf information needed for
  # quantile prediction and external tools.
  rb <- Rborist(x, y, thinLeaves = TRUE)


  # Sets splitting position for predictor 0 to far left and predictor
  # 1 to far right, others to default (median) position.

  spq <- rep(0.5, ncol(x))
  spq[0] <- 0.0
  spq[1] <- 1.0
  rb <- Rborist(x, y, splitQuant = spq)
  }

Run the code above in your browser using DataCamp Workspace