tree.control: Control parameters for fitting decision trees

Description

Configure the fitting process of individual decision trees.

Usage

tree.control(
  nodesize = 10,
  split_criterion = "gini",
  alpha = 0.05,
  cp = 0.001,
  smoothing = "none",
  mtry = "none",
  covariable = "final_4pl"
)

Value

An object of class tree.control which is a list of all necessary tree parameters.

Arguments

nodesize: Minimum number of samples contained in a terminal node. This parameter ensures that enough samples are available for performing predictions which includes fitting regression models such as 4pL models.
split_criterion: Splitting criterion for deciding when and how to split. The default is "gini"/"mse" which utilizes the Gini splitting criterion for binary risk estimation tasks and the mean squared error as impurity measure in regression tasks. Alternatively, "4pl" can be used if a quantitative covariable is supplied and the parameter covariable is chosen such that 4pL model fitting is enabled, i.e., covariable = "final_4pl" or covariable = "full_4pl". A fast modeling alternative is given by "linear" which also requires the parameter covariable to be properly chosen, i.e., covariable = "final_linear" or covariable = "full_linear".
alpha: Significance threshold for the likelihood ratio tests when using split_criterion = "4pl" or "linear". Only splits that achieve a p-value smaller than alpha are eligible.
cp: Complexity parameter. This parameter determines by which amount the impurity has to be reduced to further split a node. Here, the total tree impurity is considered. See details for a specific formula. Only used if split_criterion = "gini" or "mse".
smoothing: Shall the leaf predictions for risk estimation be smoothed? "laplace" yields Laplace smoothing. The default is "none" which does not employ smoothing.
mtry: Shall the tree fitting process be randomized as in random forests? Currently, only "sqrt" for using $\sqrt{p}$ random predictors at each node for splitting and "none" (default) for fitting conventional decision trees are supported.
covariable: How shall optional quantitative covariables be handled? "constant" ignores them. Alternatively, they can be considered as splitting variables ("_split"), used for fitting 4pL models in each leaf ("_4pl"), or used for fitting linear models in each leaf ("_linear"). If either splitting or model fitting is chosen, one should state if this should be handled over the whole search ("full_", computationally expensive) or just the final trees ("final_"). Thus, "final_4pl" would lead to fitting 4pL models in each leaf but only for the final tree fitting.

Details

For the Gini or MSE splitting criterion, if any considered split $s$ leads to $$P(t) \cdot \Delta I(s,t) > \texttt{cp}$$ for a node $t$, the empirical node probability $P(t)$ and the impurity reduction $\Delta I(s,t)$, then the node is further splitted. If not, the node is declared as a leaf. For continuous outcomes, cp will be scaled by the empirical variance of y to ensure the right scaling, i.e., cp <- cp * var(y). Since the impurity measure for continuous outcomes is the mean squared error, this can be interpreted as controlling the minimum reduction of the normalized mean squared error (NRMSE to the power of two).

If one chooses the 4pL or linear splitting criterion, likelihood ratio tests testing the alternative of better fitting individual models are employed. The corresponding test statistic asymptotically follows a $\chi^2$ distribution where the degrees of freedom are given by the difference in the number of model parameters, i.e., leading to $2 \cdot 4 - 4 = 4$ degrees of freedom in the case of 4pL models and to $2 \cdot 2 - 2 = 2$ degrees of freedom in the case of linear models.

For binary outcomes, choosing to fit linear models for evaluating the splits or for modeling the leaves actually leads to fitting LDA (linear discriminant analysis) models.