unityfor: Construct a unity forest prediction rule and compute the unity VIM.

Description

Constructs a unity forest and computes the unity variable importance measure (VIM), as described in Hornung & Hapfelmeier (2026). Currently, only categorical outcomes are supported.
The unity forest algorithm is a tree construction approach for random forests in which the first few splits are optimized jointly in order to more effectively capture interaction effects beyond marginal effects. The unity VIM quantifies the influence of each variable under the conditions in which that influence is strongest, thereby placing a stronger emphasis on interaction effects than conventional variable importance measures.
To explore the nature of the effects identified by the unity VIM, it is essential to examine covariate-representative tree roots (CRTRs), which are implemented in reprTrees.

Usage

unityfor(
  formula = NULL,
  dependent.variable.name = NULL,
  data = NULL,
  num.trees = 20000,
  num.cand.trees = 500,
  probability = TRUE,
  importance = "none",
  prop.best.splits = NULL,
  min.node.size.root = NULL,
  min.node.size = NULL,
  max.depth.root = NULL,
  max.depth = NULL,
  prop.var.root = NULL,
  mtry.sprout = NULL,
  replace = FALSE,
  sample.fraction = ifelse(replace, 1, 0.7),
  case.weights = NULL,
  class.weights = NULL,
  inbag = NULL,
  oob.error = TRUE,
  num.threads = NULL,
  write.forest = TRUE,
  verbose = TRUE
)

Value

Object of class unityfor with elements

predictions: Predicted classes/values, based on out-of-bag samples.
forest: Saved forest (If write.forest set to TRUE). Note that the variable IDs in the split.varIDs object do not necessarily represent the column number in R.
data: Training data.
variable.importance: Variable importance for each independent variable. Only available if importance is not "none".
importance.mode: Importance mode used.
prediction.error: Overall out-of-bag prediction error. For classification this is the fraction of missclassified samples, for probability estimation the Brier score and for regression the mean squared error.
confusion.matrix: Contingency table for classes and predictions based on out-of-bag samples (classification only).
call: Function call.
num.trees: Number of trees.
num.cand.trees: Number of candidate trees generated for each tree root.
num.independent.variables: Number of independent variables.
num.samples: Number of samples.
prop.var.root: Proportion of variables randomly sampled for each tree root.
mtry: Value of mtry used (in the tree sprouts).
max.depth.root: Maximal depth of the tree roots.
min.node.size.root: Minimal node size in the tree roots.
min.node.size: Value of minimal node size used.
splitrule: Splitting rule (used only in the tree sprouts).
replace: Sample with replacement.
treetype: Type of forest/tree. Classification or regression.

Arguments

formula: Object of class formula or character describing the model to fit. Interaction terms supported only for numerical variables.
dependent.variable.name: Name of outcome variable, needed if no formula given.
data: Training data of class data.frame, matrix, dgCMatrix (Matrix) or gwaa.data (GenABEL).
num.trees: Number of trees. Default is 20000.
num.cand.trees: Number of random candidate trees to generate for each tree root. Default is 500.
probability: Grow a probability forest as in Malley et al. (2012). (NOTE: Currently only probability forests are implemented, will be changed in the next version)
importance: Variable importance mode, either 'unity' (unity VIM) or 'none'.
prop.best.splits: Related to the unity VIM. Default value should generally not be modified by the user. When calculating the unity VIM, only the top prop.best.splits \(\times\) 100% of the splits -- those with the highest split criterion values weighted by node size -- are considered for each variable. The default value is 0.01, meaning that only the top 1% of splits are used. While small values are recommended, they should not be set too low to ensure that each variable has a sufficient number of splits for a reliable unity VIM computation.
min.node.size.root: Minimal node size in the tree roots. Default is 10 irrespective of the outcome type.
min.node.size: Minimal node size. Default 1 for classification and 5 for probability.
max.depth.root: Maximal depth of the tree roots. Default value is 3 and should generally not be modified by the user. Larger values can be associated with worse predictive performance for some datasets.
max.depth: Maximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree). Must be at least as large as max.depth.root.
prop.var.root: Proportion of variables randomly sampled for constructing each tree root. Default is the square root of the number of variables divided by the number of variables. Consequently, per default, for each tree root, a random subset of variables is considered, with size equal to the (rounded up) square root of the total number of variables. An exception is made for datasets with more than 100 variables, where the default for prop.var.root is set to 0.1. See the 'Details' section below for explanation.
mtry.sprout: Number of randomly sampled variables to possibly split at in each node of the tree sprouts (i.e., the branches of the trees beyond the tree roots). Default is the (rounded down) square root of the number variables.
replace: Sample with replacement. Default is FALSE.
sample.fraction: Fraction of observations to sample for each tree. Default is 1 for sampling with replacement and 0.7 for sampling without replacement.
case.weights: Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.
class.weights: Weights for the outcome classes (in order of the factor levels) in the splitting rule (cost sensitive learning). Classification and probability prediction only. For classification the weights are also applied in the majority vote in terminal nodes.
inbag: Manually set observations per tree. List of size num.trees, containing inbag counts for each observation. Can be used for stratified sampling.
oob.error: Compute OOB prediction error. Set to FALSE to save computation time.
num.threads: Number of threads. Default is number of CPUs available.
write.forest: Save unityfor.forest object, required for prediction. Set to FALSE to reduce memory usage if no prediction intended.
verbose: Show computation status and estimated runtime.

Author

Roman Hornung, Marvin N. Wright

Details

There are two reasons why, for datasets with more than 100 variables, the default value of prop.var.root is set to 0.1 rather than to the square root of the number of variables divided by the total number of variables.

First, as the total number of variables increases, the square-root-based proportion decreases. This makes it less likely that the same pairs of variables are selected together in multiple trees. This can be problematic for the unity VIM, particularly for variables that do not have marginal effects on their own but act only through interactions with one or a few other variables. Such variables are informative in tree roots only when they are used jointly with the covariates they interact with. Setting prop.var.root = 0.1 ensures that interacting covariates are selected together sufficiently often in tree roots.

Second, this choice reflects the fact that in high-dimensional datasets, typically only a small proportion of variables are informative. Applying the square-root rule in such settings may result in too few informative variables being selected, thereby reducing the likelihood of constructing predictive tree roots.

However, note that results obtained from applications of the unity forest framework to high-dimensional datasets should be interpreted with caution. For high-dimensional data, the curse of dimensionality makes the identification of individual interaction effects challenging and increases the risk of false positives. Moreover, the split points identified in the CRTRs (reprTrees) may become less precise as the number of covariates considered per tree root increases.

References

Hornung, R., Hapfelmeier, A. (2026). Unity Forests: Improving Interaction Modelling and Interpretability in Random Forests. arXiv:2601.07003, <tools:::Rd_expr_doi("10.48550/arXiv.2601.07003")>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77:1-17, <tools:::Rd_expr_doi("10.18637/jss.v077.i01")>.
Breiman, L. (2001). Random forests. Machine Learning 45:5-32, <tools:::Rd_expr_doi("10.1023/A:1010933404324")>.
Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability machines: consistent probability estimation using nonparametric learning machines. Methods of Information in Medicine 51:74-81, <tools:::Rd_expr_doi("10.3414/ME00-01-0052")>.

Examples

Run this code

## Load package:

library("unityForest")


## Set seed to make results reproducible:

set.seed(1234)


## Load wine dataset:

data(wine)


## Construct unity forest and calculate unity VIM values:

model <- unityfor(dependent.variable.name = "C", data = wine,
                  importance = "unity", num.trees = 20)

# NOTE: num.trees = 20 (in the above) would be much too small for practical 
# purposes. This small number of trees was simply used to keep the
# runtime of the example short.
# The default number of trees is num.trees = 20000.


## Inspect the rankings of the variables and variable pairs with respect to 
## the unity VIM:

sort(model$variable.importance, decreasing = TRUE)


## Prediction:

# Separate 'wine' dataset randomly in training
# and test data:
train.idx <- sample(nrow(wine), 2/3 * nrow(wine))
wine_train <- wine[train.idx, ]
wine_test <- wine[-train.idx, ]

# Construct unity forest on training data:
# NOTE again: num.trees = 20 is specified too small for practical purposes.
model_train <- unityfor(dependent.variable.name = "C", data = wine_train, 
                        importance = "none", num.trees = 20)
# NOTE: Because we are only interested in prediction here, we do not
# calculate unity VIM values (by setting importance = "none"), because 
# this speeds up calculations.

# Predict class values of the test data:
pred_wine <- predict(model_train, data = wine_test)

# Compare predicted and true class values of the test data:
table(wine_test$C, levels(wine_train$C)[apply(pred_wine$predictions, 1, which.max)])

Run the code above in your browser using DataLab