reprTrees: Select and visualize covariate-representative tree roots (CRTRs)

Description

Implements the algorithm for selecting and visualizing covariate-representative tree roots (CRTRs) as described in Hornung & Hapfelmeier (2026).
CRTRs are tree roots extracted from a unity forest that characterize the conditions under which a given variable exhibits its strongest effect on the outcome. The function selects one representative tree root for each variable and visualizes its structure to facilitate interpretation. CRTRs are essential for analyzing the effects identified by the unity VIM (unityfor). See the 'Details' section below for more details.

Usage

reprTrees(
  object,
  vars = NULL,
  numvars = 5,
  indvars = NULL,
  num.threads = NULL,
  plotit = TRUE,
  highlight_relevant = TRUE,
  box_plots = TRUE,
  density_plots = TRUE,
  add_split_line = TRUE,
  verbose = TRUE
)

Value

Object of class unityfor.reprTrees with elements

rules: List. Ing-bag statistics on the outcome at each node in the CRTRs. For classification, this provides the class frequencies and the numbers of observations representing each class.
plots: List. Generated ggplot2 plots.
var.names: Labels of the variables for which CRTRs were selected.
var.names.all: Names of all independent variables in the dataset.
num.independent.variables: Number of independent variables in the dataset.
num.samples: Number of observations in the dataset.
treetype: Tree type.
forest: Sub-forest that contains only the CRTRs.

Arguments

object: Object of class unityfor.
vars: This is an optional vector of variable names, for which CRTRs should be obtained
numvars: The number of the variables with the largest unity VIM values for which CRTRs should be obtained.
indvars: The indices of the variables with the largest unity VIM values for which CRTRs should be obtained. For example, if indvars = c(1, 3), the CRTRs for the variables with the largest and third-largest unity VIM values are obtained.
num.threads: Number of threads. Default is number of CPUs available.
plotit: Whether or not the CRTRs should be plotted or merely returned (invisibly). Default is TRUE.
highlight_relevant: Whether or not the nodes not containing the top-scoring splits for the variables of interest or their ancestor nodes should be shaded out. Default is TRUE. See the 'Details' section below for explanation.
box_plots: Whether boxplots should be used to show the outcome class-specific distributions of the variables values in the nodes with top-scoring splits (see 'Details' section for explanation). For classification only. Default is TRUE.
density_plots: Whether kernel density plots should be used to show the outcome class-specific distributions of the variable values in the nodes with top-scoring splits (see 'Details' section for explanation). For classification only. Default is TRUE.
add_split_line: Whether in the boxplots and/or density plots a line at the split point of the corresponding node should be drawn. Default is TRUE.
verbose: Verbose output on or off. Default is TRUE.

Author

Roman Hornung

Details

Further details on the descriptions below are provided in Hornung & Hapfelmeier (2026).

Covariate-representative tree roots (CRTRs). Covariate-representative tree roots (CRTRs) (Hornung & Hapfelmeier, 2026) are tree fragments (or 'tree roots' - the first few splits in the trees) extracted from a fitted unity forest (unityfor) that characterize for given variables the conditions under which each variable exerts its strongest influence on the prediction.

Technically, for a given variable, the algorithm identifies tree roots in which this variable attains particularly high split scores (top-scoring splits). From these tree roots, a representative root is extracted (Laabs et al., 2024) that best reflects the conditions under which this variable has its strongest effect.

Interpretation and subgroup effects. If a variable has a strong marginal effect, the corresponding CRTR typically contains a split on this variable at the root node (first split in the tree). In contrast, if a variable has little marginal effect but interacts with another variable, the CRTR may first split on that other variable, thereby defining a subgroup in which the variable of interest exhibits a strong conditional effect.

From a substantive perspective, CRTRs enable the exploration of variable effects that are generally not detectable by conventional methods focusing on marginal associations. In particular, CRTRs can reveal variables that have weak marginal effects but act strongly within specific subgroups defined by interactions with other variables.

Relation to unity VIM. CRTRs are closely related to the unity variable importance measure (unity VIM) (unityfor). The unity VIM quantifies the strength of variable effects under the conditions in which they are strongest. Analogously, CRTRs visualize these conditions by displaying the tree structures that give rise to the respective unity VIM values.

Accordingly, the CRTR algorithm can be used to visualize and interpret the effects identified by the unity VIM. By default, CRTRs are constructed and visualized for the five variables with the largest unity VIM values.

Scope of applicability. CRTRs should primarily be examined for variables with sufficiently large unity VIM values. Constructing CRTRs for variables with negligible importance may lead to overinterpretation, as apparent patterns may reflect random structure rather than meaningful effects.

Shaded regions in the visualization. For improved interpretability, parts of the CRTRs are shaded out by default. Specifically, only the nodes containing the top-scoring splits for the variable of interest and their ancestor nodes are shown prominently.

This design is motivated by two considerations. First, the purpose of CRTRs is to depict the conditions under which a variable exhibits its strongest effects - conditions that are defined by the ancestors of the nodes with top-scoring splits. Second, the remaining regions of the tree are of limited interpretive value. Since each CRTR is derived from tree roots selected for strong effects of a specific variable, the splitting patterns along the highlighted paths are specific for that variable. In contrast, shaded regions reflect arbitrary aspects of the overall association structure in the data and may include splits on non-informative variables, as each tree root is grown from a (small) random subset of all available variables.

Note that additional splits on the variable of interest may occur within shaded regions and can still be relevant. However, these splits do not represent the conditions under which the variable attains its strongest effects.

In-bag data for top-scoring split visualizations. The boxplots and density plots illustrating the discriminatory power of the top-scoring splits are computed exclusively based on the in-bag observations of the corresponding trees. This is consistent with the construction of the CRTRs themselves, which are derived from in-bag data only.

References

Hornung, R., Hapfelmeier, A. (2026). Unity Forests: Improving Interaction Modelling and Interpretability in Random Forests. arXiv:2601.07003, <tools:::Rd_expr_doi("10.48550/arXiv.2601.07003")>.
Laabs, B.-H., Westenberger, A., & K\"onig, I. R. (2024). Identification of representative trees in random forests based on a new tree-based distance measure. Advances in Data Analysis and Classification 18(2):363-380, <tools:::Rd_expr_doi("10.1007/s11634-023-00537-7")>.

Examples

Run this code

# \donttest{

## Load package:

library("unityForest")


## Set seed to make results reproducible:

set.seed(1234)


## Load wine dataset:

data(wine)


## Construct unity forest and calculate unity VIM values:

model <- unityfor(dependent.variable.name = "C", data = wine,
                  importance = "unity", num.trees = 2000)

# NOTE: num.trees = 2000 (in the above) would be too small for practical 
# purposes. This quite small number of trees was simply used to keep the
# runtime of the example short.
# The default number of trees is num.trees = 20000.


## Visualize the CRTRs for the five variables with the largest unity VIM
## values:

reprTrees(model, box_plots = FALSE, density_plots = FALSE)


## Visualize the CRTRs for the variables with the largest and third-largest 
## unity VIM values:

reprTrees(model, indvars = c(2, 3), box_plots = FALSE, density_plots = FALSE)


## Visualize the CRTRs for the variables with the largest and third-largest 
## unity VIM values, where density plots are shown to visualize the 
## outcome class-specific distributions of the variables values in the 
## nodes with top-scoring splits:

reprTrees(model, indvars = c(2, 3), box_plots = FALSE, density_plots = TRUE)


## Visualize the CRTRs for the variables with the largest and third-largest 
## unity VIM values, where both density plots and boxplots are shown to 
## visualize the outcome class-specific distributions of the variables values 
## in the top-scoring splits; the split points are not indicated in these
## plots:
ps <- reprTrees(model, indvars = c(2, 3), add_split_line = FALSE)


## Save one of the CRTRs with the corresponding density plot:

library("patchwork")
library("ggplot2")

p <- ps$plots[[1]]$tree_plot / ps$plots[[1]]$density_plot +
     patchwork::plot_layout(heights = c(2, 1))
p

# outfile <- file.path(tempdir(), "figure_xy.pdf")
# ggsave(outfile, device = cairo_pdf, plot = p, width = 18, 
#        height = 14)


# Note: The plots can be manipulated with the usual ggplot2 syntax, e.g.:

ps$plots[[1]]$density_plot + xlab("Proline") + labs(title = NULL, y = NULL) +
  theme(
    legend.position = c(0.95, 0.95),
    legend.justification = c(1, 1)
  )

# }

Run the code above in your browser using DataLab