corPlot: Correlation matrices with conditioning

Description

Function to to draw and visualise correlation matrices. The primary purpose is as a tool for exploratory data analysis. Hierarchical clustering is used to group similar variables.

Usage

corPlot(
  mydata,
  pollutants = NULL,
  type = "default",
  cluster = TRUE,
  method = "pearson",
  use = "pairwise.complete.obs",
  annotate = c("cor", "signif", "stars", "none"),
  dendrogram = FALSE,
  triangle = c("both", "upper", "lower"),
  diagonal = TRUE,
  cols = "default",
  r.thresh = 0.8,
  text.col = c("black", "black"),
  key.title = NULL,
  key.position = "none",
  strip.position = "top",
  auto.text = TRUE,
  plot = TRUE,
  key = NULL,
  ...
)

Value

an openair object

Arguments

mydata

A data frame which should consist of some numeric columns.

pollutants

the names of data-series in mydata to be plotted by corPlot. The default option NULL and the alternative "all" use all available valid (numeric) data.

type

Character string(s) defining how data should be split/conditioned before plotting. "default" produces a single panel using the entire dataset. Any other options will split the plot into different panels - a roughly square grid of panels if one type is given, or a 2D matrix of panels if two types are given. type is always passed to cutData(), and can therefore be any of:

A built-in type defined in cutData() (e.g., "season", "year", "weekday", etc.). For example, type = "season" will split the plot into four panels, one for each season.
The name of a numeric column in mydata, which will be split into n.levels quantiles (defaulting to 4).
The name of a character or factor column in mydata, which will be used as-is. Commonly this could be a variable like "site" to ensure data from different monitoring sites are handled and presented separately. It could equally be any arbitrary column created by the user (e.g., whether a nearby possible pollutant source is active or not).

Most openair plotting functions can take two type arguments. If two are given, the first is used for the columns and the second for the rows.

cluster

Should the data be ordered according to cluster analysis. If TRUE hierarchical clustering is applied to the correlation matrices using hclust() to group similar variables together. With many variables clustering can greatly assist interpretation.

method

The correlation method to use. Can be "pearson", "spearman" or "kendall".

use

How to handle missing values in the cor function. The default is "pairwise.complete.obs". Care should be taken with the choice of how to handle missing data when considering pair-wise correlations.

annotate

What to annotate each correlation tile with. One of:

"cor", the correlation coefficient to 2 decimal places.
"signif", an X marker if the correlation is significant.
"stars", standard significance stars.
"none", no annotation.

dendrogram

Should a dendrogram be plotted? When TRUE a dendrogram is shown on the plot. Note that this will only work for type = "default". Defaults to FALSE.

triangle

Which 'triangles' of the correlation plot should be shown? Can be "both", "lower" or "upper". Defaults to "both".

diagonal

Should the 'diagonal' of the correlation plot be shown? The diagonal of a correlation matrix is axiomatically always 1 as it represents correlating a variable with itself. Defaults to TRUE.

cols

Colours to use for plotting. Can be a pre-set palette (e.g., "turbo", "viridis", "tol", "Dark2", etc.) or a user-defined vector of R colours (e.g., c("yellow", "green", "blue", "black") - see colours() for a full list) or hex-codes (e.g., c("#30123B", "#9CF649", "#7A0403")). See openColours() for more details.

r.thresh

Values of greater than r.thresh will be shown in bold type. This helps to highlight high correlations.

text.col

The colour of the text used to show the correlation values. The first value controls the colour of negative correlations and the second positive.

key.title

Used to set the title of the legend. The legend title is passed to quickText() if auto.text = TRUE.

key.position

Location where the legend is to be placed. Allowed arguments include "top", "right", "bottom", "left" and "none", the last of which removes the legend entirely.

strip.position

Location where the facet 'strips' are located when using type. When one type is provided, can be one of "left", "right", "bottom" or "top". When two types are provided, this argument defines whether the strips are "switched" and can take either "x", "y", or "both". For example, "x" will switch the 'top' strip locations to the bottom of the plot.

auto.text

Either TRUE (default) or FALSE. If TRUE titles and axis labels will automatically try and format pollutant names and units properly, e.g., by subscripting the "2" in "NO2". Passed to quickText().

plot

When openair plots are created they are automatically printed to the active graphics device. plot = FALSE deactivates this behaviour. This may be useful when the plot data is of more interest, or the plot is required to appear later (e.g., later in a Quarto document, or to be saved to a file).

key

Deprecated; please use key.position. If FALSE, sets key.position to "none".

...

Addition options are passed on to cutData() for type handling. Some additional arguments are also available:

xlab, ylab and main override the x-axis label, y-axis label, and plot title.
layout sets the layout of facets - e.g., layout(2, 5) will have 2 columns and 5 rows.
fontsize overrides the overall font size of the plot.
border sets the border colour of each ellipse.

Author

David Carslaw

Jack Davison

Adapted from the approach taken by Sarkar (2007)

Details

The corPlot() function plots correlation matrices. The implementation relies heavily on that shown in Sarkar (2007), with a few extensions.

Correlation matrices are a very effective way of understating relationships between many variables. The corPlot() shows the correlation coded in three ways: by shape (ellipses), colour and the numeric value. The ellipses can be thought of as visual representations of scatter plot. With a perfect positive correlation a line at 45 degrees positive slope is drawn. For zero correlation the shape becomes a circle. See examples below.

With many different variables it can be difficult to see relationships between variables, i.e., which variables tend to behave most like one another. For this reason hierarchical clustering is applied to the correlation matrices to group variables that are most similar to one another (if cluster = TRUE).

If clustering is chosen it is also possible to add a dendrogram using the option dendrogram = TRUE. Note that dendrogramscan only be plotted for type = "default" i.e. when there is only a single panel. The dendrogram can also be recovered from the plot object itself and plotted more clearly; see examples below.

It is also possible to use the openair type option to condition the data in many flexible ways, although this may become difficult to visualise with too many panels.

Examples

Run this code

# basic plot
corPlot(mydata)

if (FALSE) {
# plot by season
corPlot(mydata, type = "season")

# recover dendrogram when cluster = TRUE and plot it
res <- corPlot(mydata, plot = FALSE)
plot(res$clust)

# a more interesting are hydrocarbon measurements
hc <- importAURN(site = "my1", year = 2005, hc = TRUE)

# now it is possible to see the hydrocarbons that behave most
# similarly to one another
corPlot(hc)
}

Run the code above in your browser using DataLab