PPtreeExtclass: Projection pursuit classification tree

Description

Projection Pursuit Classification Tree with Extensions

Usage

PPtreeExtclass(formula, data, PPmethod = "LDA", weight = TRUE,
                   lambda = 0.1,srule, tot = nrow(data), tol = 0.5,...)

Value

An object of class c("PPtreeExtclass", "PPtreeclass"), which is a list containing:

Tree.Struct: A matrix defining the tree structure. Each row represents a node with 5 columns: node ID, left child node ID, right/final node ID (class label if terminal node), coefficient ID (projection index), and optimization index value.
projbest.node: A matrix where each row contains the optimal 1-dimensional projection coefficients for each split node. Each row has length equal to ncol(origdata), defining the projection direction used at that node.
splitCutoff.node: A numeric vector or matrix containing the cutoff values (thresholds) used at each split node for classification decisions.
origclass: Factor vector of the original class labels from the input data.
origdata: Matrix of the original predictor variables (without the class variable).
terms: The terms object from the model frame, preserving the formula structure.

Arguments

formula: An object of class "formula" of the form class ~ x1 + x2 + ... where class is the factor variable containing class labels and x1, x2, ... are the predictor variables. Interaction terms (using *) are not supported.
data: Data frame containing both the class variable and predictor variables specified in the formula.
PPmethod: Character string specifying the projection pursuit index to use. Either "LDA" (Linear Discriminant Analysis, default) or "PDA" (Penalized Discriminant Analysis).
weight: Logical indicating whether to use weighted index calculation in LDA and PDA. When TRUE (default), class proportions are accounted for in the optimization.
lambda: Numeric penalty parameter for the PDA index, ranging from 0 to 1. Default is 0.1. Only used when PPmethod = "PDA".
srule: Logical flag for stopping rule. If TRUE (default), uses entropy-based and size-based stopping criteria. If FALSE, stops only when nodes are pure (single class) or empty.
tot: Integer specifying the total number of observations in the original dataset. Default is nrow(data). Used in conjunction with stopping rules to determine minimum node sizes.
tol: Numeric tolerance value for the entropy-based stopping rule. Nodes with entropy below this threshold will not be split further. Default is 0.5. Lower values create deeper trees.
...: Additional arguments to be passed to internal tree construction methods.

Details

Constructs a projection pursuit classification tree using various projection pursuit indices (LDA or PDA) at each split. This extended version includes customizable stopping rules based on entropy and node size criteria.

This function builds a binary classification tree where each split is determined by finding an optimal projection of the data onto a one-dimensional space using either LDA or PDA indices. The algorithm works as follows:

Tree Construction Process

At each node, find the optimal 1D projection that best separates classes
Project the data onto this direction and find an optimal cutpoint
Split observations based on the cutpoint into left and right child nodes
Recursively repeat until stopping criteria are met

Stopping Rules

When srule = TRUE, a node stops splitting if any of the following conditions hold:

The node is pure (contains only one class)
The node contains fewer than 5% of the total observations (n/tot <= 0.05)
The node entropy is below the tolerance threshold (entropy < tol)

When srule = FALSE, splitting only stops for pure or empty nodes, potentially creating deeper, more complex trees.

Projection Methods

LDA: Suitable for most classification problems with moderate dimensionality
PDA: Recommended for high-dimensional data (p > n) or data with multicollinearity

The tol parameter controls tree complexity: smaller values allow more splits (deeper trees with potentially better training accuracy but higher risk of overfitting), while larger values create simpler trees (better generalization but potentially underfitting).

References

Lee, YD, Cook, D., Park JW, and Lee, EK (2013) PPtree: Projection Pursuit Classification Tree, Electronic Journal of Statistics, 7:1369-1386.

Examples

Run this code

set.seed(234)
data(penguins)
penguins <- na.omit(penguins[, -c(2,7, 8)])
require(rsample)
penguins_spl <- rsample::initial_split(penguins, strata=species)
penguins_train <- training(penguins_spl)
penguins_test <- testing(penguins_spl)
penguins_ppt <- PPtreeExtclass(species~bill_len + bill_dep +
flipper_len + body_mass, data = penguins_train, PPmethod = "LDA", tot=nrow
(penguins_train), tol =  0.2 , srule = TRUE)

Run the code above in your browser using DataLab