Learn R Programming

ranger (version 0.4.0)

ranger: Ranger

Description

Ranger is a fast implementation of Random Forest (Breiman 2001) or recursive partitioning, particularly suited for high dimensional data. Classification, regression, and survival forests are supported. Classification and regression forests are implemented as in the original Random Forest (Breiman 2001), survival forests as in Random Survival Forests (Ishwaran et al. 2008).

Usage

ranger(formula = NULL, data = NULL, num.trees = 500, mtry = NULL,
  importance = "none", write.forest = FALSE, probability = FALSE,
  min.node.size = NULL, replace = TRUE, sample.fraction = ifelse(replace,
  1, 0.632), splitrule = NULL, case.weights = NULL,
  split.select.weights = NULL, always.split.variables = NULL,
  respect.unordered.factors = FALSE, scale.permutation.importance = FALSE,
  keep.inbag = FALSE, num.threads = NULL, save.memory = FALSE,
  verbose = TRUE, seed = NULL, dependent.variable.name = NULL,
  status.variable.name = NULL, classification = NULL)

Arguments

formula
Object of class formula or character describing the model to fit.
data
Training data of class data.frame, matrix or gwaa.data (GenABEL).
num.trees
Number of trees.
mtry
Number of variables to possibly split at in each node. Default is the (rounded down) square root of the number variables.
importance
Variable importance mode, one of 'none', 'impurity', 'permutation'. The 'impurity' measure is the Gini index for classification and the variance of the responses for regression. For survival, only 'permutation' is available.
write.forest
Save ranger.forest object, needed for prediction.
probability
Grow a probability forest as in Malley et al. (2012).
min.node.size
Minimal node size. Default 1 for classification, 5 for regression, 3 for survival, and 10 for probability.
replace
Sample with replacement.
sample.fraction
Fraction of observations to sample. Default is 1 for sampling with replacement and 0.632 for sampling without replacement.
splitrule
Splitting rule, survival only. The splitting rule can be chosen of "logrank" and "C" with default "logrank".
case.weights
Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.
split.select.weights
Numeric vector with weights between 0 and 1, representing the probability to select variables for splitting. Alternatively, a list of size num.trees, containing split select weight vectors for each tree can be used.
always.split.variables
Character vector with variable names to be always tried for splitting.
respect.unordered.factors
Regard unordered factor covariates as unordered categorical variables. If FALSE, all factors are regarded ordered.
scale.permutation.importance
Scale permutation importance by standard error as in (Breiman 2001). Only applicable if permutation variable importance mode selected.
keep.inbag
Save how often observations are in-bag in each tree.
num.threads
Number of threads. Default is number of CPUs available.
save.memory
Use memory saving (but slower) splitting mode. No effect for GWAS data.
verbose
Verbose output on or off.
seed
Random seed. Default is NULL, which generates the seed from R.
dependent.variable.name
Name of dependent variable, needed if no formula given. For survival forests this is the time variable.
status.variable.name
Name of status variable, only applicable to survival data and needed if no formula given. Use 1 for event and 0 for censoring.
classification
Only needed if data is a matrix. Set to TRUE to grow a classification forest.

Value

  • Object of class ranger with elements
  • forestSaved forest (If write.forest set to TRUE). Note that the variable IDs in the split.varIDs object do not necessarily represent the column number in R.
  • predictionsPredicted classes/values, based on out of bag samples (classification and regression only).
  • forestSaved forest (If write.forest set to TRUE). Note that the variable IDs in the split.varIDs object do not necessarily represent the column number in R.
  • predictionsPredicted classes/values, based on out of bag samples (classification and regression only).
  • variable.importanceVariable importance for each independent variable.
  • prediction.errorOverall out of bag prediction error. For classification this is the fraction of missclassified samples, for regression the mean squared error and for survival one minus Harrell's c-index.
  • r.squaredR squared. Also called explained variance or coefficient of determination (regression only).
  • confusion.matrixContingency table for classes and predictions based on out of bag samples (classification only).
  • unique.death.timesUnique death times (survival only).
  • chfEstimated cumulative hazard function for each sample (survival only).
  • survivalEstimated survival function for each sample (survival only).
  • callFunction call.
  • num.treesNumber of trees.
  • num.independent.variablesNumber of independent variables.
  • mtryValue of mtry used.
  • min.node.sizeValue of minimal node size used.
  • treetypeType of forest/tree. classification, regression or survival.
  • importance.modeImportance mode used.
  • num.samplesNumber of samples.
  • inbag.countsNumber of times the observations are in-bag in the trees.

Details

The tree type is determined by the type of the dependent variable. For factors classification trees are grown, for numeric values regression trees and for survival objects survival trees. The Gini index is used as splitting rule for classification, the estimated response variances for regression and the log-rank test for survival. For Survival the log-rank test or an AUC-based splitting rule are available.

With the probability option and factor dependent variable a probability forest is grown. Here, the estimated response variances are used for splitting, as in regression forests. Predictions are class probabilities for each sample. For details see Malley et al. (2012).

Note that for classification and regression nodes with size smaller than min.node.size can occur, like in original Random Forest. For survival all nodes contain at least min.node.size samples. Variables selected with always.split.variables are tried additionaly to the mtry variables randomly selected. In split.select.weights variables weighted with 0 are never selected and variables with 1 are always selected. Weights do not need to sum up to 1, they will be normalized later. The usage of split.select.weights can increase the computation times for large forests.

For a large number of variables and data frame as input data the formula interface can be slow or impossible to use. Alternatively dependent.variable.name (and status.variable.name for survival) can be used. Consider setting save.memory = TRUE if you encounter memory problems for very large datasets.

For GWAS data consider combining ranger with the GenABEL package. See the Examples section below for a demonstration using Plink data. All SNPs in the GenABEL object will be used for splitting. To use only the SNPs without sex or other covariates from the phenotype file, use 0 on the right hand side of the formula. Note that missing values are treated as an extra category while splitting.

See https://github.com/mnwright/ranger for the development version.

Notes:

  • Multithreading is currently not supported for Microsoft Windows platforms.

References

Wright, M. N. & Ziegler, A. (2016). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, in press. http://arxiv.org/abs/1508.04409.

Breiman, L. (2001). Random forests. Mach Learn, 45(1), 5-32. Ishwaran, H., Kogalur, U. B., Blackstone, E. H., & Lauer, M. S. (2008). Random survival forests. Ann Appl Stat, 841-860. Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inf Med, 51(1), 74.

See Also

predict.ranger

Examples

Run this code
require(ranger)

## Classification forest with default settings
ranger(Species ~ ., data = iris)

## Prediction
train.idx <- sample(nrow(iris), 2/3 * nrow(iris))
iris.train <- iris[train.idx, ]
iris.test <- iris[-train.idx, ]
rg.iris <- ranger(Species ~ ., data = iris.train, write.forest = TRUE)
pred.iris <- predict(rg.iris, dat = iris.test)
table(iris.test$Species, pred.iris$predictions)

## Variable importance
rg.iris <- ranger(Species ~ ., data = iris, importance = "impurity")
rg.iris$variable.importance

## Survival forest
require(survival)
rg.veteran <- ranger(Surv(time, status) ~ ., data = veteran)
plot(rg.veteran$unique.death.times, rg.veteran$survival[1,])

## Alternative interface
ranger(dependent.variable.name = "Species", data = iris)

## Use GenABEL interface to read Plink data into R and grow a classification forest
## The ped and map files are not included
library(GenABEL)
convert.snp.ped("data.ped", "data.map", "data.raw")
dat.gwaa <- load.gwaa.data("data.pheno", "data.raw")
phdata(dat.gwaa)$trait <- factor(phdata(dat.gwaa)$trait)
ranger(trait ~ ., data = dat.gwaa)

Run the code above in your browser using DataLab