ranger(formula = NULL, data = NULL, num.trees = 500, mtry = NULL, importance = "none", write.forest = FALSE, probability = FALSE, min.node.size = NULL, replace = TRUE, sample.fraction = ifelse(replace, 1, 0.632), case.weights = NULL, splitrule = NULL, alpha = 0.5, minprop = 0.1, split.select.weights = NULL, always.split.variables = NULL, respect.unordered.factors = "ignore", scale.permutation.importance = FALSE, keep.inbag = FALSE, holdout = FALSE, num.threads = NULL, save.memory = FALSE, verbose = TRUE, seed = NULL, dependent.variable.name = NULL, status.variable.name = NULL, classification = NULL)
formula
or character
describing the model to fit.data.frame
, matrix
or gwaa.data
(GenABEL).ranger.forest
object, needed for prediction.NULL
, which generates the seed from R
.TRUE
to grow a classification forest.ranger
with elements
forest
split.varIDs
object do not necessarily represent the column number in R.predictions
forest
split.varIDs
object do not necessarily represent the column number in R.predictions
variable.importance
prediction.error
r.squared
confusion.matrix
unique.death.times
chf
survival
call
num.trees
num.independent.variables
mtry
min.node.size
treetype
importance.mode
num.samples
inbag.counts
With the probability
option and factor dependent variable a probability forest is grown.
Here, the node impurity is used for splitting, as in classification forests.
Predictions are class probabilities for each sample.
In contrast to other implementations, each tree returns a probability estimate and these estimates are averaged for the forest probability estimate.
For details see Malley et al. (2012).
Note that for classification and regression nodes with size smaller than min.node.size can occur, like in original Random Forest.
For survival all nodes contain at least min.node.size samples.
Variables selected with always.split.variables
are tried additionaly to the mtry variables randomly selected.
In split.select.weights
variables weighted with 0 are never selected and variables with 1 are always selected.
Weights do not need to sum up to 1, they will be normalized later.
The usage of split.select.weights
can increase the computation times for large forests.
Unordered factor covariates can be handled in 3 different ways by using respect.unordered.factors
:
For 'ignore' all factors are regarded ordered, for 'partition' all possible 2-partitions are considered for splitting.
For 'order' and 2-class classification the factor levels are ordered by their proportion falling in the second class, for regression by their mean response, as described in Hastie et al. (2009), chapter 9.2.4.
For multiclass classification and survival outcomes, 'order' is experimental and should be used with care.
The use of 'order' is recommended for 2-class classification and regression, as it computationally fast and can handle an unlimited number of factor levels.
Note that the factors are only reordered once and not again in each split.
For a large number of variables and data frame as input data the formula interface can be slow or impossible to use.
Alternatively dependent.variable.name (and status.variable.name for survival) can be used.
Consider setting save.memory = TRUE
if you encounter memory problems for very large datasets.
For GWAS data consider combining ranger
with the GenABEL
package.
See the Examples section below for a demonstration using Plink
data.
All SNPs in the GenABEL
object will be used for splitting.
To use only the SNPs without sex or other covariates from the phenotype file, use 0
on the right hand side of the formula.
Note that missing values are treated as an extra category while splitting.
See https://github.com/imbs-hl/ranger for the development version.
To use multithreading on Microsoft Windows platforms, there are currently two options:
predict.ranger
require(ranger)
## Classification forest with default settings
ranger(Species ~ ., data = iris)
## Prediction
train.idx <- sample(nrow(iris), 2/3 * nrow(iris))
iris.train <- iris[train.idx, ]
iris.test <- iris[-train.idx, ]
rg.iris <- ranger(Species ~ ., data = iris.train, write.forest = TRUE)
pred.iris <- predict(rg.iris, dat = iris.test)
table(iris.test$Species, pred.iris$predictions)
## Variable importance
rg.iris <- ranger(Species ~ ., data = iris, importance = "impurity")
rg.iris$variable.importance
## Survival forest
require(survival)
rg.veteran <- ranger(Surv(time, status) ~ ., data = veteran)
plot(rg.veteran$unique.death.times, rg.veteran$survival[1,])
## Alternative interface
ranger(dependent.variable.name = "Species", data = iris)
## Not run:
# ## Use GenABEL interface to read Plink data into R and grow a classification forest
# ## The ped and map files are not included
# library(GenABEL)
# convert.snp.ped("data.ped", "data.map", "data.raw")
# dat.gwaa <- load.gwaa.data("data.pheno", "data.raw")
# phdata(dat.gwaa)$trait <- factor(phdata(dat.gwaa)$trait)
# ranger(trait ~ ., data = dat.gwaa)
# ## End(Not run)
Run the code above in your browser using DataLab