SL.extraTrees: extraTrees SuperLearner wrapper

Description

Supports the Extremely Randomized Trees package for SuperLearning, which is a variant of random forest.

Usage

SL.extraTrees(Y, X, newX, family, obsWeights, id, ntree = 500, mtry = if
  (family$family == "gaussian") max(floor(ncol(X)/3), 1) else
  floor(sqrt(ncol(X))), nodesize = if (family$family == "gaussian") 5 else 1,
  numRandomCuts = 1, evenCuts = FALSE, numThreads = 1, quantile = FALSE,
  subsetSizes = NULL, subsetGroups = NULL, tasks = NULL,
  probOfTaskCuts = mtry/ncol(X), numRandomTaskCuts = 1, verbose = FALSE,
  ...)

Arguments

Outcome variable

Covariate dataframe

newX

Optional dataframe to predict the outcome

family

"gaussian" for regression, "binomial" for binary classification.

obsWeights

Optional observation-level weights (supported but not tested)

Optional id to group observations from the same unit (not used currently).

ntree

Number of trees (default 500).

mtry

Number of features tested at each node. Default is ncol(x) / 3 for regression and sqrt(ncol(x)) for classification.

nodesize

The size of leaves of the tree. Default is 5 for regression and 1 for classification.

numRandomCuts

the number of random cuts for each (randomly chosen) feature (default 1, which corresponds to the official ExtraTrees method). The higher the number of cuts the higher the chance of a good cut.

evenCuts

if FALSE then cutting thresholds are uniformly sampled (default). If TRUE then the range is split into even intervals (the number of intervals is numRandomCuts) and a cut is uniformly sampled from each interval.

numThreads

the number of CPU threads to use (default is 1).

quantile

if TRUE then quantile regression is performed (default is FALSE), only for regression data. Then use predict(et, newdata, quantile=k) to make predictions for k quantile.

subsetSizes

subset size (one integer) or subset sizes (vector of integers, requires subsetGroups), if supplied every tree is built from a random subset of size subsetSizes. NULL means no subsetting, i.e. all samples are used.

subsetGroups

list specifying subset group for each sample: from samples in group g, each tree will randomly select subsetSizes[g] samples.

tasks

vector of tasks, integers from 1 and up. NULL if no multi-task learning. (untested)

probOfTaskCuts

probability of performing task cut at a node (default mtry / ncol(x)). Used only if tasks is specified. (untested)

numRandomTaskCuts

number of times task cut is performed at a node (default 1). Used only if tasks is specified. (untested)

verbose

Verbosity of model fitting.

...

Any remaining arguments (not supported though).

Details

If Java runs out of memory: java.lang.OutOfMemoryError: Java heap space, then (assuming you have free memory) you can increase the heap size by: options( java.parameters = "-Xmx2g" ) before calling library(extraTrees),

References

Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine learning, 63(1), 3-42.

Simm, J., de Abril, I. M., & Sugiyama, M. (2014). Tree-based ensemble multi-task learning method for classification and regression. IEICE TRANSACTIONS on Information and Systems, 97(6), 1677-1681.

Examples

Run this code

# NOT RUN {
data(Boston, package = "MASS")
Y = Boston$medv
# Remove outcome from covariate dataframe.
X = Boston[, -14]

set.seed(1)

# Sample rows to speed up example.
row_subset = sample(nrow(X), 30)

sl = SuperLearner(Y[row_subset], X[row_subset, ], family = gaussian(),
cvControl = list(V = 2), SL.library = c("SL.mean", "SL.extraTrees"))

print(sl)

# }

Run the code above in your browser using DataLab