PPtreeExt_split: Projection Pursuit Classification Tree with Random Variable Selection

Description

Constructs a projection pursuit classification tree using various projection pursuit indices. Optionally performs random variable selection at each split which can be used to include in a random forests methodology. When size.p = 1, this reduces to a PPtree algorithm.

Usage

PPtreeExt_split(
  formula,
  data,
  PPmethod = "LDA",
  size.p = 1,
  lambda = 0.1,
  entro = FALSE,
  entroindiv = FALSE,
  ...
)

Value

An object of class "PPtreeclass", which is a list containing:

Tree.Struct: A matrix defining the tree structure of the projection pursuit classification tree. Each row represents a node with columns: node ID, left child node ID, right child node ID (or final class if terminal), coefficient ID, and index value.
projbest.node: A matrix where each row contains the optimal 1-dimensional projection coefficients for each split node. The number of columns equals the number of predictor variables.
splitCutoff.node: A data frame containing the cutoff values and splitting rules for each split node. Contains 8 rule columns defining the classification boundaries.
origclass: Factor vector of the original class labels from the input data.
origdata: Matrix of the original predictor variables (without the class variable).

Arguments

formula: A formula of the form class ~ x1 + x2 + ... where class is the factor variable containing class labels and x1, x2, ... are the predictor variables.
data: Data frame containing both the class variable and predictor variables.
PPmethod: Character string specifying the projection pursuit index to use. Either "LDA" (Linear Discriminant Analysis, default) or "PDA" (Penalized Discriminant Analysis).
size.p: Numeric value between 0 and 1 specifying the proportion of variables to randomly sample at each split. Default is 1, which uses all variables at each split (standard PPtree). Values less than 1 introduce randomness similar to random forests, which can improve robustness and reduce overfitting.
lambda: Numeric penalty parameter for the PDA index, ranging from 0 to 1. When lambda = 0, no penalty is applied and PDA equals LDA. When lambda = 1, all variables are treated as uncorrelated. Default is 0.1. Only used when PPmethod = "PDA".
entro: Logical indicating whether to use entropy-based stopping rules for tree construction. Default is FALSE.
entroindiv: Logical indicating whether to compute entropy for each individual observation in the 1D projection. Default is FALSE.
...: Additional arguments to be passed to internal tree construction methods.

Details

This function extends the standard PPtree algorithm by incorporating random variable selection at each split, and define the split based on subsetting groups. The algorithm:

At each node, randomly samples size.p * 100% of the predictor variables
Finds the optimal projection using the selected variables and specified index (LDA or PDA)
Determines a cutpoint based on entropy splitting if entropy parameters are set
Recursively splits the data until stopping criteria are met

The entro parameter enables entropy-based stopping rules that halt splitting when nodes become sufficiently pure or small. The entroindiv parameter computes entropy at the individual observation level in the projected space, which can provide more refined splitting decisions.

When size.p = 1, all variables are used at each split and the function behaves as a standard PPtree. Values of size.p < 1 introduce randomness that can improve model robustness, especially for high-dimensional data or when building ensemble models.

References

Lee, YD, Cook, D., Park JW, and Lee, EK (2013) PPtree: Projection pursuit classification tree, Electronic Journal of Statistics, 7:1369-1386.

Examples

Run this code

data(penguins)
penguins <- na.omit(penguins[, -c(2,7, 8)])
require(rsample)
penguins_spl <- rsample::initial_split(penguins, strata=species)
penguins_train <- training(penguins_spl)
penguins_test <- testing(penguins_spl)
penguins_ppt2 <- PPtreeExt_split(species~bill_len + bill_dep +
flipper_len + body_mass, data = penguins_train, PPmethod = "LDA", tot=nrow
(penguins_train), tol =  0.5 , entro=TRUE)

Run the code above in your browser using DataLab