Projection Pursuit Classification Tree with Extensions
PPtreeExtclass(formula, data, PPmethod = "LDA", weight = TRUE,
lambda = 0.1,srule, tot = nrow(data), tol = 0.5,...)An object of class c("PPtreeExtclass", "PPtreeclass"), which is a list containing:
A matrix defining the tree structure. Each row represents a node with 5 columns: node ID, left child node ID, right/final node ID (class label if terminal node), coefficient ID (projection index), and optimization index value.
A matrix where each row contains the optimal 1-dimensional
projection coefficients for each split node. Each row has length equal to
ncol(origdata), defining the projection direction used at that node.
A numeric vector or matrix containing the cutoff values (thresholds) used at each split node for classification decisions.
Factor vector of the original class labels from the input data.
Matrix of the original predictor variables (without the class variable).
The terms object from the model frame, preserving the formula structure.
An object of class "formula" of the form class ~ x1 + x2 + ...
where class is the factor variable containing class labels and
x1, x2, ... are the predictor variables. Interaction terms (using *)
are not supported.
Data frame containing both the class variable and predictor variables specified in the formula.
Character string specifying the projection pursuit index to use.
Either "LDA" (Linear Discriminant Analysis, default) or "PDA"
(Penalized Discriminant Analysis).
Logical indicating whether to use weighted index calculation in LDA
and PDA. When TRUE (default), class proportions are accounted for in the
optimization.
Numeric penalty parameter for the PDA index, ranging from 0 to 1.
Default is 0.1. Only used when PPmethod = "PDA".
Logical flag for stopping rule. If TRUE (default), uses entropy-based
and size-based stopping criteria. If FALSE, stops only when nodes are pure
(single class) or empty.
Integer specifying the total number of observations in the original dataset.
Default is nrow(data). Used in conjunction with stopping rules to determine
minimum node sizes.
Numeric tolerance value for the entropy-based stopping rule. Nodes with entropy below this threshold will not be split further. Default is 0.5. Lower values create deeper trees.
Additional arguments to be passed to internal tree construction methods.
Constructs a projection pursuit classification tree using various projection pursuit indices (LDA or PDA) at each split. This extended version includes customizable stopping rules based on entropy and node size criteria.
This function builds a binary classification tree where each split is determined by finding an optimal projection of the data onto a one-dimensional space using either LDA or PDA indices. The algorithm works as follows:
At each node, find the optimal 1D projection that best separates classes
Project the data onto this direction and find an optimal cutpoint
Split observations based on the cutpoint into left and right child nodes
Recursively repeat until stopping criteria are met
When srule = TRUE, a node stops splitting if any of the following conditions hold:
The node is pure (contains only one class)
The node contains fewer than 5% of the total observations (n/tot <= 0.05)
The node entropy is below the tolerance threshold (entropy < tol)
When srule = FALSE, splitting only stops for pure or empty nodes, potentially
creating deeper, more complex trees.
LDA: Suitable for most classification problems with moderate dimensionality
PDA: Recommended for high-dimensional data (p > n) or data with multicollinearity
The tol parameter controls tree complexity: smaller values allow more splits
(deeper trees with potentially better training accuracy but higher risk of overfitting),
while larger values create simpler trees (better generalization but potentially
underfitting).
Lee, YD, Cook, D., Park JW, and Lee, EK (2013) PPtree: Projection Pursuit Classification Tree, Electronic Journal of Statistics, 7:1369-1386.
TreeExt.construct, PPtreeExt_split,
findproj_Ext, predict.PPtreeExtclass
set.seed(234)
data(penguins)
penguins <- na.omit(penguins[, -c(2,7, 8)])
require(rsample)
penguins_spl <- rsample::initial_split(penguins, strata=species)
penguins_train <- training(penguins_spl)
penguins_test <- testing(penguins_spl)
penguins_ppt <- PPtreeExtclass(species~bill_len + bill_dep +
flipper_len + body_mass, data = penguins_train, PPmethod = "LDA", tot=nrow
(penguins_train), tol = 0.2 , srule = TRUE)
Run the code above in your browser using DataLab