plotMcl: Plots of the (estimated) within-class distributions of variables

Description

This function allows to visualise the (estimated) distributions of one or several variables for each of the classes of the outcomes. This allows to study how exactly variables of interest influence the outcome, which is crucial for interpretive purposes. Two types of visualisations are available: density plots and boxplots. See the 'Details' section below for further explanation.

Usage

plotMcl(
  data,
  yvarname,
  varnames,
  plot_type = c("both", "density", "boxplot")[1],
  addtitles = TRUE,
  plotit = TRUE
)

Value

A list of ggplot2 plots returned invisibly.

Arguments

data: Data frame containing the variables.
yvarname: Name of outcome variable.
varnames: Names of the variables for which plots should be created.
plot_type: Plot type, one of the following: "both" (the default), "density", "boxplot". If "density", "density" plot are produced, if "boxplot", "boxplot" plots are produced, and if "both", both "density" plots and "boxplot" plots are produced. See the 'Details' section below for details.
addtitles: Set to TRUE (default) to add headings providing the names of the respective variables to the plots.
plotit: This states whether the plots are actually plotted or merely returned as ggplot objects. Default is TRUE.

Author

Roman Hornung

Details

For the "density" plots, kernel density estimates (obtained using the density() function from base R) of the within-class distributions are plotted in the same plot using different colors and, depending on the number of classes, different line types. To account for the different number of observations per class, each density is multiplied by the proportion of observations from that class. The resulting scaled densities can be interpreted in terms of the local density of the observations from each class relative to those from the other classes. For example, if a scaled density has the largest value in a particular region, this can be interpreted as the respective class being the most frequent in that region. Another example: If the scaled density of class "A" is twice as large as the scaled density of class "B" in a particular region, this can be interpreted to mean that there are twice as many observations of class "A" as of class "B" in that region.

In the "density" plots, only classes represented by at least two observations are considered. If the number of classes is greater than 7, the different classes are distinguished using both colors and line styles. To indicate the absolute numbers of observations in the different regions, the locations of the observations from the different classes are visualized using a rug plot on the x-axis, using the same colors and line types as for the density plots. If the number of observations is greater than 1,000, a random subset of 1,000 observations is shown in the rug plot instead of all observations for visual clarity.

The "boxplot" plots show the (estimated) within-class distributions side by side using boxplots. All classes are considered, even those represented by only a single observation. For the plot_type="both" option, which displays both "density" and "boxplot" plots, the boxplots are displayed using the same colors and ( if applicable) line styles as the kernel density estimates, for clarity. Boxplots of classes for which no kernel density estimates were obtained (i.e., those of the classes represented by single observations) are shown in grey.

Note that plots are only generated for those variables in varnames that have at least as many unique values as there are outcome classes. For categorical variables, the category labels are printed on the x- or y-axis of the "density" or "boxplot" plots, respectively. The rug plots of the "density" plots are produced only for numeric variables.

References

Hornung, R., Hapfelmeier, A. (2024). Multi forests: Variable importance for multi-class outcomes. arXiv:2409.08925, <tools:::Rd_expr_doi("10.48550/arXiv.2409.08925")>.
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <tools:::Rd_expr_doi("10.1007/s42979-021-00920-1")>.

Examples

Run this code

if (FALSE) {

## Load package:

library("diversityForest")



## Plot "density" and "boxplot" plots (default: plot_type = "both") for the 
## first three variables in the "hars" dataset:

data(hars)
plotMcl(data = hars, yvarname = "Activity", varnames = c("tBodyAcc.mean...X", 
                                                         "tBodyAcc.mean...Y", 
                                                         "tBodyAcc.mean...Z"))


## Plot only the "density" plots for these variables:

plotMcl(data = hars, yvarname = "Activity", 
        varnames = c("tBodyAcc.mean...X", "tBodyAcc.mean...Y", 
                     "tBodyAcc.mean...Z"), plot_type = "density")

## Plot the "density" plots for these variables, but without titles of the
## plots:

plotMcl(data = hars, yvarname = "Activity", varnames = 
          c("tBodyAcc.mean...X", "tBodyAcc.mean...Y", "tBodyAcc.mean...Z"), 
        plot_type = "density", addtitles = FALSE)


## Make density plots for these variables, but only save them in a list "ps"
## without plotting them ("plotit = FALSE"):

ps <- plotMcl(data = hars, yvarname = "Activity", varnames = 
                c("tBodyAcc.mean...X", "tBodyAcc.mean...Y", 
                  "tBodyAcc.mean...Z"), plot_type = "density", 
              addtitles = FALSE, plotit = FALSE)


## The plots can be manipulated later by using ggplot2 functionalities:

library("ggplot2")
p1 <- ps[[1]] + ggtitle("First variable in the dataset") + 
  labs(x="Variable values", y="my scaled density")

p2 <- ps[[3]] + ggtitle("Third variable in the dataset") + 
  labs(x="Variable values", y="my scaled density")


## Combine both of the above plots:

library("ggpubr")
p <- ggarrange(p1, p2, ncol = 2)
p

## # Save as PDF:
## ggsave(file="mypathtofolder/FigureXY1.pdf", width=14, height=6)

}

Run the code above in your browser using DataLab