Constructs a projection pursuit classification tree using various projection pursuit
indices. Optionally performs random variable selection at each split which can be used to include in a random forests methodology. When size.p = 1, this reduces to a PPtree algorithm.
PPtreeExt_split(
formula,
data,
PPmethod = "LDA",
size.p = 1,
lambda = 0.1,
entro = FALSE,
entroindiv = FALSE,
...
)An object of class "PPtreeclass", which is a list containing:
A matrix defining the tree structure of the projection pursuit classification tree. Each row represents a node with columns: node ID, left child node ID, right child node ID (or final class if terminal), coefficient ID, and index value.
A matrix where each row contains the optimal 1-dimensional projection coefficients for each split node. The number of columns equals the number of predictor variables.
A data frame containing the cutoff values and splitting rules for each split node. Contains 8 rule columns defining the classification boundaries.
Factor vector of the original class labels from the input data.
Matrix of the original predictor variables (without the class variable).
A formula of the form class ~ x1 + x2 + ... where class
is the factor variable containing class labels and x1, x2, ... are the
predictor variables.
Data frame containing both the class variable and predictor variables.
Character string specifying the projection pursuit index to use.
Either "LDA" (Linear Discriminant Analysis, default) or "PDA"
(Penalized Discriminant Analysis).
Numeric value between 0 and 1 specifying the proportion of variables to randomly sample at each split. Default is 1, which uses all variables at each split (standard PPtree). Values less than 1 introduce randomness similar to random forests, which can improve robustness and reduce overfitting.
Numeric penalty parameter for the PDA index, ranging from 0 to 1.
When lambda = 0, no penalty is applied and PDA equals LDA. When
lambda = 1, all variables are treated as uncorrelated. Default is 0.1.
Only used when PPmethod = "PDA".
Logical indicating whether to use entropy-based stopping rules for
tree construction. Default is FALSE.
Logical indicating whether to compute entropy for each individual
observation in the 1D projection. Default is FALSE.
Additional arguments to be passed to internal tree construction methods.
This function extends the standard PPtree algorithm by incorporating random variable selection at each split, and define the split based on subsetting groups. The algorithm:
At each node, randomly samples size.p * 100% of the predictor variables
Finds the optimal projection using the selected variables and specified index (LDA or PDA)
Determines a cutpoint based on entropy splitting if entropy parameters are set
Recursively splits the data until stopping criteria are met
The entro parameter enables entropy-based stopping rules that halt splitting
when nodes become sufficiently pure or small. The entroindiv parameter computes
entropy at the individual observation level in the projected space, which can provide
more refined splitting decisions.
When size.p = 1, all variables are used at each split and the function
behaves as a standard PPtree. Values of size.p < 1 introduce randomness
that can improve model robustness, especially for high-dimensional data or when
building ensemble models.
Lee, YD, Cook, D., Park JW, and Lee, EK (2013) PPtree: Projection pursuit classification tree, Electronic Journal of Statistics, 7:1369-1386.
TreeExt.construct, findproj_Ext,
LDAopt_Ext, PDAopt_Ext
data(penguins)
penguins <- na.omit(penguins[, -c(2,7, 8)])
require(rsample)
penguins_spl <- rsample::initial_split(penguins, strata=species)
penguins_train <- training(penguins_spl)
penguins_test <- testing(penguins_spl)
penguins_ppt2 <- PPtreeExt_split(species~bill_len + bill_dep +
flipper_len + body_mass, data = penguins_train, PPmethod = "LDA", tot=nrow
(penguins_train), tol = 0.5 , entro=TRUE)
Run the code above in your browser using DataLab