Control Forest Hyper Parameters: Control for Conditional Tree Forests

Description

Various parameters that control aspects of the `cforest' fit via its `control' argument.

Usage

cforest_unbiased(…)
cforest_classical(…)
cforest_control(teststat = "max",
                testtype = "Teststatistic",
                mincriterion = qnorm(0.9),
                savesplitstats = FALSE,
                ntree = 500, mtry = 5, replace = TRUE,
                fraction = 0.632, trace = FALSE, …)

Arguments

teststat

a character specifying the type of the test statistic to be applied.

testtype

a character specifying how to compute the distribution of the test statistic.

mincriterion

the value of the test statistic (for testtype == "Teststatistic"), or 1 - p-value (for other values of testtype) that must be exceeded in order to implement a split.

mtry

number of input variables randomly sampled as candidates at each node for random forest like algorithms. Bagging, as special case of a random forest without random input variable sampling, can be performed by setting mtry either equal to NULL or manually equal to the number of input variables.

savesplitstats

a logical determining whether the process of standardized two-sample statistics for split point estimate is saved for each primary split.

ntree

number of trees to grow in a forest.

replace

a logical indicating whether sampling of observations is done with or without replacement.

fraction

fraction of number of observations to draw without replacement (only relevant if replace = FALSE).

trace

a logical indicating if a progress bar shall be printed while the forest grows.

…

additional arguments to be passed to ctree_control.

Value

An object of class ForestControl-class.

Details

All three functions return an object of class ForestControl-class defining hyper parameters to be specified via the control argument of cforest.

The arguments teststat, testtype and mincriterion determine how the global null hypothesis of independence between all input variables and the response is tested (see ctree). The argument nresample is the number of Monte-Carlo replications to be used when testtype = "MonteCarlo".

A split is established when the sum of the weights in both daugther nodes is larger than minsplit, this avoids pathological splits at the borders. When stump = TRUE, a tree with at most two terminal nodes is computed.

The mtry argument regulates a random selection of mtry input variables in each node. Note that here mtry is fixed to the value 5 by default for merely technical reasons, while in randomForest the default values for classification and regression vary with the number of input variables. Make sure that mtry is defined properly before using cforest.

It might be informative to look at scatterplots of input variables against the standardized two-sample split statistics, those are available when savesplitstats = TRUE. Each node is then associated with a vector whose length is determined by the number of observations in the learning sample and thus much more memory is required.

The number of trees ntree can be increased for large numbers of input variables.

Function cforest_unbiased returns the settings suggested for the construction of unbiased random forests (teststat = "quad", testtype = "Univ", replace = FALSE) by Strobl et al. (2007) and is the default since version 0.9-90. Hyper parameter settings mimicing the behaviour of randomForest are available in cforest_classical which have been used as default up to version 0.9-14.

Please note that cforest, in contrast to randomForest, doesn't grow trees of maximal depth. To grow large trees, set mincriterion = 0.

References

Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis and Torsten Hothorn (2007). Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformatics, 8, 25. DOI: 10.1186/1471-2105-8-25