Configure the fitting process of individual decision trees.
tree.control(
nodesize = 10,
split_criterion = "gini",
alpha = 0.05,
cp = 0.001,
smoothing = "none",
mtry = "none",
covariable = "final_4pl"
)
An object of class tree.control
which is a list
of all necessary tree parameters.
Minimum number of samples contained in a terminal node. This parameter ensures that enough samples are available for performing predictions which includes fitting regression models such as 4pL models.
Splitting criterion for deciding
when and how to split. The default is "gini"
/"mse"
which
utilizes the Gini splitting criterion for binary risk
estimation tasks and the mean squared error as impurity
measure in regression tasks. Alternatively, "4pl"
can be
used if a quantitative covariable is supplied and
the parameter covariable
is chosen such that 4pL
model fitting is enabled, i.e., covariable = "final_4pl"
or covariable = "full_4pl"
.
A fast modeling alternative is given by "linear"
which also
requires the parameter covariable
to be properly
chosen, i.e., covariable = "final_linear"
or covariable = "full_linear"
.
Significance threshold for the likelihood ratio
tests when using split_criterion = "4pl"
or "linear"
.
Only splits that achieve a p-value smaller than alpha
are eligible.
Complexity parameter. This parameter determines
by which amount the impurity has to be reduced to further
split a node. Here, the total tree impurity is considered.
See details for a specific formula. Only used if
split_criterion = "gini"
or "mse"
.
Shall the leaf predictions for risk
estimation be smoothed? "laplace"
yields Laplace smoothing.
The default is "none"
which does not employ smoothing.
Shall the tree fitting process be randomized
as in random forests? Currently, only "sqrt"
for using
\(\sqrt{p}\) random predictors at each node for splitting
and "none"
(default) for fitting conventional decision trees
are supported.
How shall optional quantitative covariables
be handled? "constant"
ignores them. Alternatively,
they can be considered as splitting variables ("_split"
),
used for fitting 4pL models in each leaf ("_4pl"
), or used
for fitting linear models in each leaf ("_linear"
). If either
splitting or model fitting is chosen, one should state if this
should be handled over the whole search ("full_"
,
computationally expensive) or just the final trees
("final_"
). Thus, "final_4pl"
would lead to fitting
4pL models in each leaf but only for the final tree fitting.
For the Gini or MSE splitting criterion,
if any considered split \(s\) leads to
$$P(t) \cdot \Delta I(s,t) > \texttt{cp}$$
for a node \(t\), the empirical node probability
\(P(t)\) and the impurity reduction \(\Delta I(s,t)\),
then the node is further splitted. If not, the node is
declared as a leaf.
For continuous outcomes, cp
will be scaled by the
empirical variance of y
to ensure the right scaling,
i.e., cp <- cp * var(y)
. Since the impurity measure
for continuous outcomes is the mean squared error, this can
be interpreted as controlling the minimum reduction of the
normalized mean squared error (NRMSE to the power of two).
If one chooses the 4pL or linear splitting criterion, likelihood ratio tests testing the alternative of better fitting individual models are employed. The corresponding test statistic asymptotically follows a \(\chi^2\) distribution where the degrees of freedom are given by the difference in the number of model parameters, i.e., leading to \(2 \cdot 4 - 4 = 4\) degrees of freedom in the case of 4pL models and to \(2 \cdot 2 - 2 = 2\) degrees of freedom in the case of linear models.
For binary outcomes, choosing to fit linear models for evaluating the splits or for modeling the leaves actually leads to fitting LDA (linear discriminant analysis) models.