divfor: Construct a basic diversity forest prediction rule that uses univariable, binary splitting.

Description

Implements the most basic form of diversity forests that uses univariable, binary splitting. Currently, categorical, metric, and survival outcomes are supported.

Usage

divfor(
  formula = NULL,
  data = NULL,
  num.trees = 500,
  mtry = NULL,
  importance = "none",
  write.forest = TRUE,
  probability = FALSE,
  min.node.size = NULL,
  max.depth = NULL,
  replace = TRUE,
  sample.fraction = ifelse(replace, 1, 0.632),
  case.weights = NULL,
  class.weights = NULL,
  splitrule = NULL,
  num.random.splits = 1,
  alpha = 0.5,
  minprop = 0.1,
  split.select.weights = NULL,
  always.split.variables = NULL,
  respect.unordered.factors = NULL,
  scale.permutation.importance = FALSE,
  keep.inbag = FALSE,
  inbag = NULL,
  holdout = FALSE,
  quantreg = FALSE,
  oob.error = TRUE,
  num.threads = NULL,
  save.memory = FALSE,
  verbose = TRUE,
  seed = NULL,
  dependent.variable.name = NULL,
  status.variable.name = NULL,
  classification = NULL,
  nsplits = 30,
  proptry = 1
)

Value

Object of class divfor with elements

forest: Saved forest (If write.forest set to TRUE). Note that the variable IDs in the split.varIDs object do not necessarily represent the column number in R.
predictions: Predicted classes/values, based on out-of-bag samples (classification and regression only).
variable.importance: Variable importance for each independent variable.
prediction.error: Overall out-of-bag prediction error. For classification this is the fraction of missclassified samples, for probability estimation the Brier score, for regression the mean squared error and for survival one minus Harrell's C-index.
r.squared: R squared. Also called explained variance or coefficient of determination (regression only). Computed on out-of-bag data.
confusion.matrix: Contingency table for classes and predictions based on out-of-bag samples (classification only).
unique.death.times: Unique death times (survival only).
chf: Estimated cumulative hazard function for each sample (survival only).
survival: Estimated survival function for each sample (survival only).
call: Function call.
num.trees: Number of trees.
num.independent.variables: Number of independent variables.
min.node.size: Value of minimal node size used.
treetype: Type of forest/tree. classification, regression or survival.
importance.mode: Importance mode used.
num.samples: Number of samples.
splitrule: Splitting rule.
replace: Sample with replacement.
nsplits: Value of nsplits used.
proptry: Value of proptry used.

Arguments

formula: Object of class formula or character describing the model to fit. Interaction terms supported only for numerical variables.
data: Training data of class data.frame, matrix, dgCMatrix (Matrix) or gwaa.data (GenABEL).
num.trees: Number of trees. Default is 500.
mtry: Artefact from 'ranger'. NOT needed for diversity forests.
importance: Variable importance mode, one of 'none', 'impurity', 'impurity_corrected', 'permutation'. The 'impurity' measure is the Gini index for classification, the variance of the responses for regression and the sum of test statistics (see splitrule) for survival. NOTE: Currently, only "permutation" (and "none") work for diversity forests.
write.forest: Save divfor.forest object, required for prediction. Set to FALSE to reduce memory usage if no prediction intended.
probability: Grow a probability forest as in Malley et al. (2012). NOTE: Not yet implemented for diversity forests!
min.node.size: Minimal node size. Default 1 for classification, 5 for regression, 3 for survival, and 5 for probability.
max.depth: Maximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree).
replace: Sample with replacement.
sample.fraction: Fraction of observations to sample. Default is 1 for sampling with replacement and 0.632 for sampling without replacement. For classification, this can be a vector of class-specific values.
case.weights: Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.
class.weights: Weights for the outcome classes (in order of the factor levels) in the splitting rule (cost sensitive learning). Classification and probability prediction only. For classification the weights are also applied in the majority vote in terminal nodes.
splitrule: Splitting rule. For classification and probability estimation "gini" or "extratrees" with default "gini". For regression "variance", "extratrees" or "maxstat" with default "variance". For survival "logrank", "extratrees", "C" or "maxstat" with default "logrank". NOTE: For diversity forests currently only the default splitting rules are supported.
num.random.splits: Artefact from 'ranger'. NOT needed for diversity forests.
alpha: For "maxstat" splitrule: Significance threshold to allow splitting. NOT needed for diversity forests.
minprop: For "maxstat" splitrule: Lower quantile of covariate distribution to be considered for splitting. NOT needed for diversity forests.
split.select.weights: Numeric vector with weights between 0 and 1, representing the probability to select variables for splitting. Alternatively, a list of size num.trees, containing split select weight vectors for each tree can be used.
always.split.variables: Currently not useable. Character vector with variable names to be always selected.
respect.unordered.factors: Handling of unordered factor covariates. One of 'ignore' and 'order' (the option 'partition' possible in 'ranger' is not (yet) possible with diversity forests). Default is 'ignore'. Alternatively TRUE (='order') or FALSE (='ignore') can be used.
scale.permutation.importance: Scale permutation importance by standard error as in (Breiman 2001). Only applicable if permutation variable importance mode selected.
keep.inbag: Save how often observations are in-bag in each tree.
inbag: Manually set observations per tree. List of size num.trees, containing inbag counts for each observation. Can be used for stratified sampling.
holdout: Hold-out mode. Hold-out all samples with case weight 0 and use these for variable importance and prediction error.
quantreg: Prepare quantile prediction as in quantile regression forests (Meinshausen 2006). Regression only. Set keep.inbag = TRUE to prepare out-of-bag quantile prediction.
oob.error: Compute OOB prediction error. Set to FALSE to save computation time, e.g. for large survival forests.
num.threads: Number of threads. Default is number of CPUs available.
save.memory: Use memory saving (but slower) splitting mode. No effect for survival and GWAS data. Warning: This option slows down the tree growing, use only if you encounter memory problems. NOT needed for diversity forests.
verbose: Show computation status and estimated runtime.
seed: Random seed. Default is NULL, which generates the seed from R. Set to 0 to ignore the R seed.
dependent.variable.name: Name of outcome variable, needed if no formula given. For survival forests this is the time variable.
status.variable.name: Name of status variable, only applicable to survival data and needed if no formula given. Use 1 for event and 0 for censoring.
classification: Only needed if data is a matrix. Set to TRUE to grow a classification forest.
nsplits: Number of candidate splits to sample for each split. Default is 30.
proptry: Parameter that restricts the number of candidate splits considered for small nodes. If nsplits is larger than proptry times the number of all possible splits, the number of candidate splits to draw is reduced to the largest integer smaller than proptry times the number of all possible splits. Default is 1, which corresponds to always using nsplits candidate splits.

Author

Roman Hornung, Marvin N. Wright

References

Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <tools:::Rd_expr_doi("10.1007/s42979-021-00920-1")>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77:1-17, <tools:::Rd_expr_doi("10.18637/jss.v077.i01")>.
Breiman, L. (2001). Random forests. Machine Learning 45:5-32, <tools:::Rd_expr_doi("10.1023/A:1010933404324")>.
Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability machines: consistent probability estimation using nonparametric learning machines. Methods of Information in Medicine 51:74-81, <tools:::Rd_expr_doi("10.3414/ME00-01-0052")>.
Meinshausen (2006). Quantile Regression Forests. Journal of Machine Learning Research 7:983-999.

Examples

Run this code

if (FALSE) {

## Load package:
library("diversityForest")

## Set seed to obtain reproducible results:
set.seed(1234)

## Diversity forest with default settings (NOT recommended)
# Classification:
divfor(Species ~ ., data = iris, num.trees = 20)
# Regression:
iris2 <- iris; iris2$Species <- NULL; iris2$Y <- rnorm(nrow(iris2))
divfor(Y ~ ., data = iris2, num.trees = 20)
# Survival:
library("survival")
divfor(Surv(time, status) ~ ., data = veteran, num.trees = 20, respect.unordered.factors = "order")
# NOTE: num.trees = 20 is specified too small for practical 
# purposes - the prediction performance of the resulting 
# forest will be suboptimal!!
# In practice, num.trees = 500 (default value) or a 
# larger number should be used.

## Diversity forest with specified values for nsplits and proptry (NOT recommended)
divfor(Species ~ ., data = iris, nsplits = 10, proptry = 0.4, num.trees = 20)
# NOTE again: num.trees = 20 is specified too small for practical purposes.

## Applying diversity forest after optimizing the values of nsplits and proptry (recommended)
tuneres <- tunedivfor(formula = Species ~ ., data = iris, num.trees.pre = 20)
# NOTE: num.trees.pre = 20 is specified too small for practical 
# purposes - the out-of-bag error estimates of the forests 
# constructed during optimization will be much too variable!!
# In practice, num.trees.pre = 500 (default value) or a 
# larger number should be used.
divfor(Species ~ ., data = iris, nsplits = tuneres$nsplitsopt, 
  proptry = tuneres$proptryopt, num.trees = 20)
# NOTE again: num.trees = 20 is specified too small for practical purposes.

## Prediction
train.idx <- sample(nrow(iris), 2/3 * nrow(iris))
iris.train <- iris[train.idx, ]
iris.test <- iris[-train.idx, ]
tuneres <- tunedivfor(formula = Species ~ ., data = iris.train, num.trees.pre = 20)
# NOTE again: num.trees.pre = 20 is specified too small for practical purposes.
rg.iris <- divfor(Species ~ ., data = iris.train, nsplits = tuneres$nsplitsopt, 
  proptry = tuneres$proptryopt, num.trees = 20)
# NOTE again: num.trees = 20 is specified too small for practical purposes.
pred.iris <- predict(rg.iris, data = iris.test)
table(iris.test$Species, pred.iris$predictions)

## Variable importance
rg.iris <- divfor(Species ~ ., data = iris, importance = "permutation", num.trees = 20)
# NOTE again: num.trees = 20 is specified too small for practical purposes.
rg.iris$variable.importance
}

Run the code above in your browser using DataLab