Projection pursuit classification random forest

require(PPforest) require(dplyr) require(RColorBrewer) require(GGally) require(gridExtra) require(PPtreeViz) library(ggplot2) library(knitr) set.seed(310756) #reproducibility
knitr::opts_chunk$set(message = FALSE, warning = FALSE, cache = TRUE, autodep=TRUE, cache.lazy=FALSE ) opts_knit$set(eval.after = 'fig.cap') theme_set(theme_bw(base_family="serif"))

Introduction

The PPforest package (projection pursuit random forest) contains functions to run a projection pursuit random forest for classification problems. This method utilize combinations of variables in each tree construction. In a random forest each split is based on a single variable, chosen from a subset of predictors. In the PPforest, each split is based on a linear combination of randomly chosen variables. The linear combination is computed by optimizing a projection pursuit index, to get a projection of the variables that best separates the classes. The PPforest uses the PPtree algorithm, which fits a single tree to the data. Utilizing linear combinations of variables to separate classes takes the correlation between variables into account, and can outperform the basic forest when separations between groups occurs on combinations of variables. Two projection pursuit indexes, LDA and PDA, are used for PPforest.

To improve the speed performance PPforest package, PPtree algorithm was translated to Rcpp. PPforest package utilizes a number of R packages some of them included in "suggests" not to load them all at package start-up.

You can install the package from CRAN:

install.package(PPforest) library(PPforest)

Or the development version of PPforest can be installed from github using:

library(devtools) install_github("natydasilva/PPforest") library(PPforest)

##Projection pursuit classification forest

In PPforest, projection pursuit classification trees are used as the individual model to be combined in the forest. The original algorithm is in PPtreeViz package, we translate the original tree algorithm into Rcpp to improve the speed performance to run the forest.

One important characteristic of PPtree is that treats the data always as a two-class system, when the classes are more than two the algorithm uses a two step projection pursuits optimization in every node split. Let $(X_i,y_i)$ the data set, $X_i$ is a p-dimensional vector of explanatory variables and $y_i\in {1,2,\ldots G}$ represents class information with $i=1,\ldots n$.

In the first step optimize a projection pursuit index to find an optimal one-dimension projection $\alpha^$ for separating all classes in the current data. With the projected data redefine the problem in a two class problem by comparing means, and assign a new label $G1$ or $G2$ to each observation, a new variable $y_i^$ is created. The new groups $G1$ and $G2$ can contain more than one original classes. Next step is to find an optimal one-dimensional projection $\alpha$, using $(X_i,yi^*)$ to separate the two class problem $G1$ and $G2$. The best separation of $G1$ and $G2$ is determine in this step and the decision rule is defined for the current node, if $\sum{i=1}^p \alpha_i M1< c$ then assign $G1$ to the left node else assign $G2$ to the right node, where $M1$ is the mean of $G1$. For each groups we can repeat all the previous steps until $G1$ and $G2$ have only one class from the original classes. Base on this process to grow the tree, the depth of PPtree is at most the number of classes because one class is assigned only to one final node.

Trees from PPtree algorithm are simple, they use the association between variables to find separation. If a linear boundary exists, PPtree produces a tree without misclassification.

Projection pursuit random forest algorithm description

  1. Let N the number of cases in the training set $\Theta=(X,Y)$, $B$ bootstrap samples from the training set are taking (samples of size N with replacement).

  2. For each bootstrap sample a \verb PPtree is grown to the largest extent possible $h(x, {\Theta_k})$. No pruning. This tree is grown using step 3 modification.

  3. Let M the number of input variables, a number of $m<<M$ variables are selected at random at each node and the best split based on a linear combination of these randomly chosen variables. The linear combination is computed by optimizing a projection pursuit index, to get a projection of the variables that best separates the classes.

  4. Predict the classes of each case not included in the bootstrap sample and compute oob error.

  5. Based on majority vote predict the class for new data.

###Overview PPforest package

PPforest package implements a classification random forest using projection pursuit classification trees. The following table present all the functions in PPforest package.

Function Description
baggtree For each bootstrap sample grow a projection pursuit tree (PPtree object).
node_data Data structure with the projected and boundary by node and class
permute_importance Obtain the permuted importance variable measure
ppf_avg_imp Computes a global importance measure for a PPforest object, average importance measure for a pptree over all the trees.
PPclassify Predict class for the test set and calculate prediction error after finding the PPtree structure
ppf_global_imp Computes a global importance measure for a PPforest object
PPforest Runs a Projection pursuit random forest
PPtree_split Projection pursuit classification tree with random variable selection in each split
print.PPforest Print PPforest object
predict.PPforest Predict class for the test set and calculate prediction error
ternary_str Data structure with the projected and boundary by node and class
tree_pred Obtain predicted class for new data from baggtree function or PPforest

Also PPforest package includes some data set that were used to test the predictive performance of our method. The data sets included are: crab, fishcatch, glass, image, leukemia, lymphoma NCI60, parkinson and wine.

Example

Australian crab data set will be used as example. This data contains measurements on rock crabs of the genus Leptograpsus. There are 200 observations from two species (blue and orange) and for each specie (50 in each one) there are 50 males and 50 females. Class variable has 4 classes with the combinations of specie and sex (BlueMale, BlueFemale, OrangeMale and OrangeFemale). The data were collected on site at Fremantle, Western Australia. For each specimen, five measurements were made, using vernier calipers.

  1. FL the size of the frontal lobe length, in mm
  2. RW rear width, in mm
  3. CL length of mid line of the carapace, in mm
  4. CW maximum width of carapace, in mm
  5. BD depth of the body; for females, measured after displacement of the abdomen, in mm

To visualize this data set we use a scatterplot matrix from the package GGally

 

 

a <- GGally::ggpairs(PPforest::crab, columns = 2:6, ggplot2::aes(colour = Type, alpha=.1), lower = list(continuous = 'points'), axisLabels='none', upper=list(continuous='blank') , legend = NULL) capmatrix<-"Scatter plot matrix of crab data " a

 

 

In this figure we can see a strong, positive and linear association between the different variables. Also look like the classes can be separated by linear combinations.

The main function of the package is PPforest which implements a projection pursuit random forest.

PPtree_split this function implements a projection pursuit classification tree with random variable selection in each split, based on the original PPtreeViz algorithm. This function returns a PPtreeclass object. To use this function we need to specify a formula describing the model to be fitted response~predictors (form), data is a data frame with the complete data set. Also we need to specify the method PPmethod, it is the index to use for projection pursuit: 'LDA' or 'PDA', size.p is the proportion of variables randomly sampled in each split. If size.p = 1 a classic PPtreeclass object will be fitted using all the variables in each node partition instead of a subset of them. lambda penalty parameter in PDA index and is between 0 to 1 . he following example fits a projection pursuit classification tree constructed using 0.6 of the variables (3 out of 5) in each node split. We selected LDA method.

Tree.crab <- PPforest::PPtree_split("Type~.", data = crab, PPmethod = "LDA", size.p = 0.6) Tree.crab

PPforest function runs a projection pursuit random forest. The arguments are a data frame with the data information, class with the name of the class variable argument. size.tr to specify the proportion of observations using in the training. Using this function we have the option to split the data in training and test using size.tr directly. size.tr is the proportion of data used in the training and the test proportion will be 1- size.tr. The number of trees in the forest is specified using the argument m. The argument size.p is the sample proportion of the variables used in each node split, PPmethod is the projection pursuit index to be optimized, two options LDA and PDA are available.

pprf.crab <- PPforest::PPforest(data = crab, class = "Type", size.tr = .8, m = 200, size.p = .5, PPmethod = 'LDA', parallel =FALSE, cores = 2) pprf.crab

PPforest print a summary result from the model with the confusion matrix information and the oob-error rate in a similar way randomForest packages does.

This function returns the predicted values of the training data, training error, test error and predicted test values. Also there is the information about out of bag error for the forest and also for each tree in the forest. Bootstrap samples, output of all the trees in the forest from , proximity matrix and vote matrix, number of trees grown in the forest, number of predictor variables selected to use for splitting at each node. Confusion matrix of the prediction (based on OOb data), the training data and test data and vote matrix are also returned.

The printed version of a PPforest object follows the randomForest printed version to make them comparable. Based on confusion matrix, we can observe that the biggest error is for BlueMale class. Most of the wrong classified values are between BlueFemale and BlueMale.

The output from a PPforest object contains a lot of information as we can see in the next output.

str(pprf.crab, max.level = 1 )

For example to get the predicted values for the test data we can use the PPforest output:

pprf.crab$prediction.test

If new data are available you can use the function trees_pred to get the predicted classes by PPforest object.

trees_pred(pprf.crab, xnew = newdata, parallel = TRUE)

The PPforest algorithm calculates variable importance in two ways: (1) permuted importance using accuracy, and (2) importance based on projection coefficients on standardized variables.

The permuted variable importance is comparable with the measure defined in the classical random forest algorithm. It is computed using the out of bag (oob) sample for the tree $k\;\;(B^{(k)})$ for each $X_j$ predictor variable. Then the permuted importance of the variable $X_j$ in the tree $k$ can be defined as:

[ IMP^{(k)}(Xj) = \frac{\sum{i \in B^{(k)} } I(y_i=\hat y_i^{(k)})-I(yi=\hat y{i,P_j}^{(k)})}{|B^{(k)}|} ]

\noindent where $\hat yi^{(k)}$ is the predicted class for the observation $i$ in the tree $k$ and $y{i,P_j}^{(k)}$ is the predicted class for the observation $i$ in the tree $k$ after permuting the values for variable $X_j$. The global permuted importance measure is the average importance over all the trees in the forest. This measure is based on comparing the accuracy of classifying out-of-bag observations, using the true class with permuted (nonsense) class. To compute this measure you should use permute_importance function.  

 

impo1 <- permute_importance(pprf.crab) impo1

 

 

ggplot(impo1, aes(x = imp, y = nm) ) + geom_point() capimp1 <- "Permuted importance variable"

 

 

This function returns a data frame with permuted importance measures, imp is the permuted importance measure defined in Brieman paper, imp2 is the permuted importance measure defined in randomForest package, the standard deviation (sd.im and sd.imp2) for each measure is computed and the also the standardized measure.

For the second importance measure, the coefficients of each projection are examined. The magnitude of these values indicates importance, if the variables have been standardized. The variable importance for a single tree is computed by a weighted sum of the absolute values of the coefficients across nodes. The weights takes the number of classes in each node into account [@lee2013pptree]. Then the importance of the variable $X_j$ in the PPtree $k$ can be defined as:

[ IMP_{pptree}^{(k)}(Xj)=\sum{nd = 1}^{nn}\frac{|\alpha{nd}^{(k)}|}{cl{nd} } ]

Where $\alpha_{nd}^{(k)}$ is the projected coefficient for node $ns$ and variable $k$ and $nn$ the total number of node partitions in the tree $k$.

The global variable importance in a PPforest then can be defined in different ways. The most intuitive is the average variable importance from each PPtree across all the trees in the forest.

[ IMP_{ppforest1}(Xj)=\frac{\sum{k=1}^K IMP_{pptree}^{(k)}(X_j)}{K} ]

Alternatively we have defined a global importance measure for the forest as a weighted mean of the absolute value of the projection coefficients across all nodes in every tree. The weights are based on the projection pursuit indexes in each node ($Ix_{nd}$), and 1-(OOB-error of each tree)($acc_k$).

[IMP_{ppforest2}(Xj)=\frac{\sum{k=1}^K acck \sum{nd = 1}^{nn}\frac{Ix{nd}|\alpha{nd}^{(k)}|}{nn }}{K} ]  

 

impo2 <- ppf_avg_imp(pprf.crab, "Type") impo2

 

 

ggplot(impo2, aes(x = mean, y = variable) ) + geom_point() capimp2<- "Average importance variable"

 

 

Finally you can get the last importance measure we have proposed for the PPforest using `ppf_global_imp' function.

 

 

impo3 <- ppf_global_imp(data = crab, class = "Type", pprf.crab) impo3

 

 

ggplot(impo3, aes(x = mean, y = variable) ) + geom_point() capimp3 <- "Global importance variable"

Using the information available in the PPforest object, some visualization can be done. I will include some useful examples to visualize the data and some of the most important diagnostics in a forest structure.

To describe the data structure a parallel plot can be done, the data were standardized and the color represents the class variable.

 

 

parallel <- function(ppf){ myscale <- function(x) (x - mean(x)) / sd(x) scale.dat <- ppf$train %>% dplyr::mutate_at(dplyr::vars(-matches(ppf$class.var)), dplyr::funs(myscale)) scale.dat.melt <- scale.dat %>% dplyr::mutate(ids = 1:nrow(ppf$train)) %>% tidyr::gather(var,Value,-Type,-ids) scale.dat.melt$Variables <- as.numeric(as.factor(scale.dat.melt$var)) colnames(scale.dat.melt)[1] <- "Class" ggplot2::ggplot(scale.dat.melt, ggplot2::aes(x = Variables, y = Value, group = ids, key = ids, colour = Class, var = var)) + ggplot2::geom_line(alpha = 0.3) + ggplot2::scale_x_discrete(limits = levels(as.factor(scale.dat.melt$var)), expand = c(0.01,0.01)) + ggplot2::ggtitle("Data parallel plot ") + ggplot2::theme(legend.position = "none", axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2") } capar <-"Parallel coordinate plot of crab data" parallel(pprf.crab)

 

 

ternary_str is an auxiliary functions in PPforest to get the data structure needed to do a ternary plot or a generalized ternary plot if more than 3 classes are available. Because the PPforest is composed of many tree fits on subsets of the data, a lot of statistics can be calculated to analyze as a separate data set, and better understand how the model is working. Some of the diagnostics of interest are: variable importance, OOB error rate, vote matrix and proximity matrix.

With a decision tree we can compute for every pair of observations the proximity matrix. This is a $nxn$ matrix where if two cases $k_i$ and $k_j$ are in the same terminal node increase their proximity by one, at the end normalize the proximities by dividing by the number of trees. To visualize the proximity matrix we use a scatter plot with information from multidimensional scaling method. In this plot color indicates the true species and sex. For this data two dimensions are enough to see the four groups separated quite well. Some crabs are clearly more similar to a different group, though, especially in examining the sex differences.

 

 

mdspl2d <- function(ppf, lege = "bottom", siz = 3, k = 2) { d <- diag(nrow(ppf$train)) d <- as.dist(d + 1 - ppf$proximity) rf.mds <- stats::cmdscale(d, eig = TRUE, k = k) colnames(rf.mds$points) <- paste("MDS", 1:k, sep = "") df <- data.frame(Class = ppf$train[, 1], rf.mds$points) mds <- ggplot2::ggplot(data = df) + ggplot2::geom_point(ggplot2::aes(x = MDS1, y = MDS2, color = Class), size = I(siz), alpha = .5) + ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2", name = "Type") + ggplot2::theme(legend.position = lege, aspect.ratio = 1) mds } capmds<- "Multidimensional scaling plot to examine similarities between cases" mdspl2d(ppf = pprf.crab)

 

 

The vote matrix ($n \times p$) contains the proportion of times each observation was classified to each class, whole oob. Two possible approaches to visualize the vote matrix information are shown, with a side-by-side jittered dot plot or with ternary plots. A side-by-side jittered dotplot is used for the display, where class is displayed on one axis and proportion is displayed on the other. For each dotplot, the ideal arrangement is that points of observations in that class have values bigger than 0.5, and all other observations have less. This data is close to the ideal but not perfect, e.g. there are a few blue male crabs (orange) that are frequently predicted to be blue females (green), and a few blue female crabs predicted to be another class.

 

 

side <- function(ppf, ang = 0, lege = "bottom", siz = 3, ttl = "") { voteinf <- data.frame(ids = 1:length(ppf$train[, 1]), Type = ppf$train[, 1], ppf$votes, pred = ppf$prediction.oob ) %>% tidyr::gather(Class, Probability, -pred, -ids, -Type) ggplot2::ggplot(data = voteinf, ggplot2::aes(Class, Probability, color = Type)) + ggplot2::geom_jitter(height = 0, size = I(siz), alpha = .5) + ggtitle(ttl) + ylab("Proportion") + ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2") + ggplot2::theme(legend.position = lege, legend.text = ggplot2::element_text(angle = ang)) + ggplot2::labs(colour = "Class") } capside <-"Vote matrix representation by a jittered side-by-side dotplot. Each dotplot shows the proportion of times the case was predicted into the group, with 1 indicating that the case was always predicted to the group and 0 being never." side(pprf.crab)

 

 

A ternary plot is a triangular diagram that shows the proportion of three variables that sum to a constant and is done using barycentric coordinates. Compositional data lies in a $(p-1)$-D simplex in $p$-space. One advantage of ternary plot is that are good to visualize compositional data and the proportion of three variables in a two dimensional space can be shown. When we have tree classes a ternary plot are well defined. With more than tree classes the ternary plot idea need to be generalized.@sutherland2000orca suggest the best approach to visualize compositional data will be to project the data into the $(p-1)-$D space (ternary diagram in $2-D$) This will be the approach used to visualize the vote matrix information.

A ternary plot is a triangular diagram used to display compositional data with three components. More generally, compositional data can have any number of components, say $p$, and hence is contrained to a $(p-1)$-D simplex in $p$-space. The vote matrix is an example of compositional data, with $G$ components.  

 

pl_ter <- function(dat, dx, dy ){ p1 <- dat[[1]] %>% dplyr::filter(pair %in% paste(dx, dy, sep = "-") ) %>% dplyr::select(Class, x, y) %>% ggplot2::ggplot(aes(x, y, color = Class)) + ggplot2::geom_segment(data = dat[[2]], aes(x = x1, xend = x2, y = y1, yend = y2), color = "black" ) + ggplot2::geom_point(size = I(3), alpha = .5) + ggplot2::labs(y = " ", x = " ") + ggplot2::theme(legend.position = "none", aspect.ratio = 1) + ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2") + ggplot2::labs(x = paste0("T", dx, ""), y = paste0("T", dy, " ")) + ggplot2::theme(aspect.ratio = 1) p1 } p1 <- pl_ter(ternary_str(pprf.crab, id = c(1, 2, 3), sp = 3, dx = 1, dy = 2), 1, 2 ) p2 <- pl_ter(ternary_str(pprf.crab, id = c(1, 2, 3), sp = 3, dx = 1, dy = 3), 1, 3) p3 <- pl_ter(ternary_str(pprf.crab, id = c(1, 2, 3), sp = 3, dx = 2, dy = 3), 2, 3) gridExtra::grid.arrange(p1, p2, p3, ncol = 3) capter <- "Generalized ternary plot representation of the vote matrix for four classes. The tetrahedron is shown pairwise. Each point corresponds to one observation and color is the true class."

 

 

To see a complete description about how to visualize a PPforest object read Interactive Graphics for Visually Diagnosing Forest Classifiers in R [@da2017interactive].

REFERENCES