Learn R Programming

unityForest (version 0.1.0)

unityfor: Construct a unity forest prediction rule and compute the unity VIM.

Description

Constructs a unity forest and computes the unity variable importance measure (VIM), as described in Hornung & Hapfelmeier (2026). Currently, only categorical outcomes are supported.
The unity forest algorithm is a tree construction approach for random forests in which the first few splits are optimized jointly in order to more effectively capture interaction effects beyond marginal effects. The unity VIM quantifies the influence of each variable under the conditions in which that influence is strongest, thereby placing a stronger emphasis on interaction effects than conventional variable importance measures.
To explore the nature of the effects identified by the unity VIM, it is essential to examine covariate-representative tree roots (CRTRs), which are implemented in reprTrees.

Usage

unityfor(
  formula = NULL,
  dependent.variable.name = NULL,
  data = NULL,
  num.trees = 20000,
  num.cand.trees = 500,
  probability = TRUE,
  importance = "none",
  prop.best.splits = NULL,
  min.node.size.root = NULL,
  min.node.size = NULL,
  max.depth.root = NULL,
  max.depth = NULL,
  prop.var.root = NULL,
  mtry.sprout = NULL,
  replace = FALSE,
  sample.fraction = ifelse(replace, 1, 0.7),
  case.weights = NULL,
  class.weights = NULL,
  inbag = NULL,
  oob.error = TRUE,
  num.threads = NULL,
  write.forest = TRUE,
  verbose = TRUE
)

Value

Object of class unityfor with elements

predictions

Predicted classes/values, based on out-of-bag samples.

forest

Saved forest (If write.forest set to TRUE). Note that the variable IDs in the split.varIDs object do not necessarily represent the column number in R.

data

Training data.

variable.importance

Variable importance for each independent variable. Only available if importance is not "none".

importance.mode

Importance mode used.

prediction.error

Overall out-of-bag prediction error. For classification this is the fraction of missclassified samples, for probability estimation the Brier score and for regression the mean squared error.

confusion.matrix

Contingency table for classes and predictions based on out-of-bag samples (classification only).

call

Function call.

num.trees

Number of trees.

num.cand.trees

Number of candidate trees generated for each tree root.

num.independent.variables

Number of independent variables.

num.samples

Number of samples.

prop.var.root

Proportion of variables randomly sampled for each tree root.

mtry

Value of mtry used (in the tree sprouts).

max.depth.root

Maximal depth of the tree roots.

min.node.size.root

Minimal node size in the tree roots.

min.node.size

Value of minimal node size used.

splitrule

Splitting rule (used only in the tree sprouts).

replace

Sample with replacement.

treetype

Type of forest/tree. Classification or regression.

Arguments

formula

Object of class formula or character describing the model to fit. Interaction terms supported only for numerical variables.

dependent.variable.name

Name of outcome variable, needed if no formula given.

data

Training data of class data.frame, matrix, dgCMatrix (Matrix) or gwaa.data (GenABEL).

num.trees

Number of trees. Default is 20000.

num.cand.trees

Number of random candidate trees to generate for each tree root. Default is 500.

probability

Grow a probability forest as in Malley et al. (2012). (NOTE: Currently only probability forests are implemented, will be changed in the next version)

importance

Variable importance mode, either 'unity' (unity VIM) or 'none'.

prop.best.splits

Related to the unity VIM. Default value should generally not be modified by the user. When calculating the unity VIM, only the top prop.best.splits \(\times\) 100% of the splits -- those with the highest split criterion values weighted by node size -- are considered for each variable. The default value is 0.01, meaning that only the top 1% of splits are used. While small values are recommended, they should not be set too low to ensure that each variable has a sufficient number of splits for a reliable unity VIM computation.

min.node.size.root

Minimal node size in the tree roots. Default is 10 irrespective of the outcome type.

min.node.size

Minimal node size. Default 1 for classification and 5 for probability.

max.depth.root

Maximal depth of the tree roots. Default value is 3 and should generally not be modified by the user. Larger values can be associated with worse predictive performance for some datasets.

max.depth

Maximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree). Must be at least as large as max.depth.root.

prop.var.root

Proportion of variables randomly sampled for constructing each tree root. Default is the square root of the number of variables divided by the number of variables. Consequently, per default, for each tree root, a random subset of variables is considered, with size equal to the (rounded up) square root of the total number of variables. An exception is made for datasets with more than 100 variables, where the default for prop.var.root is set to 0.1. See the 'Details' section below for explanation.

mtry.sprout

Number of randomly sampled variables to possibly split at in each node of the tree sprouts (i.e., the branches of the trees beyond the tree roots). Default is the (rounded down) square root of the number variables.

replace

Sample with replacement. Default is FALSE.

sample.fraction

Fraction of observations to sample for each tree. Default is 1 for sampling with replacement and 0.7 for sampling without replacement.

case.weights

Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.

class.weights

Weights for the outcome classes (in order of the factor levels) in the splitting rule (cost sensitive learning). Classification and probability prediction only. For classification the weights are also applied in the majority vote in terminal nodes.

inbag

Manually set observations per tree. List of size num.trees, containing inbag counts for each observation. Can be used for stratified sampling.

oob.error

Compute OOB prediction error. Set to FALSE to save computation time.

num.threads

Number of threads. Default is number of CPUs available.

write.forest

Save unityfor.forest object, required for prediction. Set to FALSE to reduce memory usage if no prediction intended.

verbose

Show computation status and estimated runtime.

Author

Roman Hornung, Marvin N. Wright

Details

There are two reasons why, for datasets with more than 100 variables, the default value of prop.var.root is set to 0.1 rather than to the square root of the number of variables divided by the total number of variables.

First, as the total number of variables increases, the square-root-based proportion decreases. This makes it less likely that the same pairs of variables are selected together in multiple trees. This can be problematic for the unity VIM, particularly for variables that do not have marginal effects on their own but act only through interactions with one or a few other variables. Such variables are informative in tree roots only when they are used jointly with the covariates they interact with. Setting prop.var.root = 0.1 ensures that interacting covariates are selected together sufficiently often in tree roots.

Second, this choice reflects the fact that in high-dimensional datasets, typically only a small proportion of variables are informative. Applying the square-root rule in such settings may result in too few informative variables being selected, thereby reducing the likelihood of constructing predictive tree roots.

However, note that results obtained from applications of the unity forest framework to high-dimensional datasets should be interpreted with caution. For high-dimensional data, the curse of dimensionality makes the identification of individual interaction effects challenging and increases the risk of false positives. Moreover, the split points identified in the CRTRs (reprTrees) may become less precise as the number of covariates considered per tree root increases.

References

  • Hornung, R., Hapfelmeier, A. (2026). Unity Forests: Improving Interaction Modelling and Interpretability in Random Forests. arXiv:2601.07003, <tools:::Rd_expr_doi("10.48550/arXiv.2601.07003")>.

  • Wright, M. N., Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77:1-17, <tools:::Rd_expr_doi("10.18637/jss.v077.i01")>.

  • Breiman, L. (2001). Random forests. Machine Learning 45:5-32, <tools:::Rd_expr_doi("10.1023/A:1010933404324")>.

  • Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability machines: consistent probability estimation using nonparametric learning machines. Methods of Information in Medicine 51:74-81, <tools:::Rd_expr_doi("10.3414/ME00-01-0052")>.

See Also

predict.unityfor

Examples

Run this code
## Load package:

library("unityForest")


## Set seed to make results reproducible:

set.seed(1234)


## Load wine dataset:

data(wine)


## Construct unity forest and calculate unity VIM values:

model <- unityfor(dependent.variable.name = "C", data = wine,
                  importance = "unity", num.trees = 20)

# NOTE: num.trees = 20 (in the above) would be much too small for practical 
# purposes. This small number of trees was simply used to keep the
# runtime of the example short.
# The default number of trees is num.trees = 20000.


## Inspect the rankings of the variables and variable pairs with respect to 
## the unity VIM:

sort(model$variable.importance, decreasing = TRUE)


## Prediction:

# Separate 'wine' dataset randomly in training
# and test data:
train.idx <- sample(nrow(wine), 2/3 * nrow(wine))
wine_train <- wine[train.idx, ]
wine_test <- wine[-train.idx, ]

# Construct unity forest on training data:
# NOTE again: num.trees = 20 is specified too small for practical purposes.
model_train <- unityfor(dependent.variable.name = "C", data = wine_train, 
                        importance = "none", num.trees = 20)
# NOTE: Because we are only interested in prediction here, we do not
# calculate unity VIM values (by setting importance = "none"), because 
# this speeds up calculations.

# Predict class values of the test data:
pred_wine <- predict(model_train, data = wine_test)

# Compare predicted and true class values of the test data:
table(wine_test$C, levels(wine_train$C)[apply(pred_wine$predictions, 1, which.max)])

Run the code above in your browser using DataLab