scatterHex: scatter plot where observations are grouped into hexagonal bins and then summarized

Description

scatter plot where observations are grouped into hexagonal bins and then summarized

Usage

scatterHex(
  data_frame,
  x.by,
  y.by,
  color.by = NULL,
  bins = 30,
  color.method = NULL,
  split.by = NULL,
  rows.use = NULL,
  color.panel = dittoColors(),
  colors = seq_along(color.panel),
  x.adjustment = NULL,
  y.adjustment = NULL,
  color.adjustment = NULL,
  x.adj.fxn = NULL,
  y.adj.fxn = NULL,
  color.adj.fxn = NULL,
  multivar.split.dir = c("col", "row"),
  split.nrow = NULL,
  split.ncol = NULL,
  split.adjust = list(),
  min.density = NA,
  max.density = NA,
  min.color = "#F0E442",
  max.color = "#0072B2",
  min.opacity = 0.2,
  max.opacity = 1,
  min = NA,
  max = NA,
  rename.color.groups = NULL,
  xlab = x.by,
  ylab = y.by,
  main = "make",
  sub = NULL,
  theme = theme_bw(),
  do.contour = FALSE,
  contour.color = "black",
  contour.linetype = 1,
  do.ellipse = FALSE,
  do.label = FALSE,
  labels.size = 5,
  labels.highlight = TRUE,
  labels.repel = TRUE,
  labels.split.by = split.by,
  labels.repel.adjust = list(),
  add.trajectory.by.groups = NULL,
  add.trajectory.curves = NULL,
  trajectory.group.by,
  trajectory.arrow.size = 0.15,
  add.xline = NULL,
  xline.linetype = "dashed",
  xline.color = "black",
  add.yline = NULL,
  yline.linetype = "dashed",
  yline.color = "black",
  legend.show = TRUE,
  legend.color.title = "make",
  legend.color.breaks = waiver(),
  legend.color.breaks.labels = waiver(),
  legend.density.title = "Observations",
  legend.density.breaks = waiver(),
  legend.density.breaks.labels = waiver(),
  show.grid.lines = TRUE,
  data.out = FALSE
)

Value

A ggplot object where colored hexagonal bins are used to summarize observations in a scatter plot.

Alternatively, if data.out=TRUE, a list containing three slots is output: the plot (named 'plot'), a data.table containing the updated underlying data for target rows (named 'data'), and a list providing mappings of final column names in 'data' to given plot aesthetics (named 'cols_used'), because modification of newly made columns is required for many features.

Arguments

data_frame

A data_frame where columns are features and rows are observations you might wish to visualize.

x.by, y.by

Single strings denoting the name of a column of data_frame containing numeric data to use for the x- and y-axis of the scatterplot.

color.by

Single string denoting the name of a column of data_frame to use, instead of point density, for setting the color of plotted hexagons. Alternatively, a string vector naming multiple such columns of data to plot at once.

bins

Numeric or numeric vector giving the number of hexagonal bins in the x and y directions. Set to 30 by default.

color.method

Single string that specifies how color.by data should be summarized per each hexagonal bin. Options, and the default, depend on whether the color.by-data is continuous versus discrete:

Continuous: String naming a function for how target data should be summarized for each bin. Can be any function that inputs (summarizes) a numeric vector and outputs a single numeric value. Default is median. Other useful options are sum, mean, sd, or max. You can also use a custom function as long as you give it a name; e.g. first run logsum <- function(x) { log(sum(x)) } externally, then give color.method = "logsum"

Discrete: A string signifying whether the color should (default) be simply based on the "max" grouping of the bin, or based on the "max.prop"ortion of observations belonging to any grouping.

split.by

1 or 2 strings denoting the name(s) of column(s) of data_frame containing discrete data to use for faceting / separating data points into separate plots.

When 2 columns are named, c(row,col), the first is used as rows and the second is used for columns of the resulting facet grid.

When 1 column is named, shape control can be achieved with split.nrow and split.ncol

rows.use

String vector of rownames of data_frame OR an integer vector specifying the row-indices of data points which should be plotted.

Alternatively, a Logical vector, the same length as the number of rows in data_frame, where TRUE values indicate which rows to plot.

color.panel

String vector which sets the colors to draw from when color.by indicates discrete data. dittoColors() by default, see dittoColors for contents.

A named vector can be used if names are matched to the distinct values of the color.by data.

colors

Integer vector, the indexes / order, of colors from color.panel to actually use.

Useful for quickly swapping around colors of the default set (when not using names for color matching).

x.adjustment, y.adjustment, color.adjustment

A recognized string indicating whether numeric x.by, y.by, and color.by data should be used directly (default) or should be adjusted to be

"z-score": scaled with the scale() function to produce a relative-to-mean z-score representation
"relative.to.max": divided by the maximum value to give percent of max values between [0,1]

Ignored if the target data is not numeric as these known adjustments target numeric data only.

In order to leave the unedited data available for use in other features, the adjusted data are put in a new column and that new column is used for plotting.

x.adj.fxn, y.adj.fxn, color.adj.fxn

If you wish to apply a function to edit the x.by, y.by, or color.by data before use, in a way not possible with the color.adjustment input, this input can be given a function which takes in a vector of values as input and returns a vector of values of the same length as output.

For example, function(x) {log2(x)} or as.factor.

In order to leave the unedited data available for use in other features, the adjusted data are put in a new column and that new column is used for plotting.

multivar.split.dir

"row" or "col", sets the direction of faceting used for 'var' values when:

var is given multiple column names
AND split.by is used to provide an additional feature to facet by

split.nrow, split.ncol

Integers which set the dimensions of faceting/splitting when faceting by a single feature.

split.adjust

A named list which allows extra parameters to be pushed through to the faceting function call. List elements should be valid inputs to the faceting functions, e.g. `list(scales = "free")`.

For options, when giving 1 column to split.by, see facet_wrap, OR when giving 2 columns to split.by, see facet_grid.

min.density, max.density

Number which sets the min/max values used for the density scale. Used no matter whether density is represented through opacity or color.

min.color, max.color

color for the min/max values of the color scale.

min.opacity, max.opacity

Scalar between [0,1] which sets the minimum or maximum opacity used for the density legend (when color is used for color.by data and density is shown via opacity).

min, max

Number which sets the values associated with the minimum or maximum color for color.by data.

rename.color.groups

String vector which sets new names for the identities of color.by groups.

xlab, ylab

Strings which set the labels for the axes. To remove, set to NULL.

main

String, sets the plot title. The default title is either "Density", color.by, or NULL, depending on the identity of color.by. To remove, set to NULL.

sub

String, sets the plot subtitle.

theme

A ggplot theme which will be applied before internal adjustments. Default = theme_bw(). See https://ggplot2.tidyverse.org/reference/ggtheme.html for other options and ideas.

do.contour

Logical. Whether density-based contours should be displayed.

contour.color

String that sets the color of the do.contour contours.

contour.linetype

String or numeric which sets the type of line used for do.contour contours. Defaults to "solid", but see linetype for other options.

do.ellipse

Logical. Whether color.by groups should be surrounded by median-centered ellipses.

do.label

Logical. Whether to add text labels near the center (median) of color.by groups.

labels.size

Number which sets the size of labels text when do.label = TRUE.

labels.highlight

Logical. Whether labels should have a box behind them when do.label = TRUE.

labels.repel

Logical, that sets whether the labels' placements will be adjusted with ggrepel to avoid intersections between labels and plot bounds when do.label = TRUE. TRUE by default.

labels.split.by

String of one or two column names which controls the facet-split calculations for label placements. Defaults to split.by, so generally there is no need to adjust this except when if you plan to apply faceting externally.

labels.repel.adjust

A named list which allows extra parameters to be pushed through to ggrepel function calls. List elements should be valid inputs to the geom_label_repel by default, or geom_text_repel when labels.highlight = FALSE.

add.trajectory.by.groups

List of vectors representing trajectory paths, each from start-group to end-group, where vector contents are the group-names indicated by the trajectory.group.by column of data_frame.

add.trajectory.curves

List of matrices, each representing coordinates for a trajectory path, from start to end, where matrix columns represent x and y coordinates of the paths.

trajectory.group.by

String denoting the name of a column of data_frame to use for generating trajectories from data point groups.

trajectory.arrow.size

Number representing the size of trajectory arrows, in inches. Default = 0.15.

add.xline

numeric value(s) where one or multiple vertical line(s) should be added.

xline.linetype

String which sets the type of line for add.xline. Defaults to "dashed", but any ggplot linetype will work.

xline.color

String that sets the color(s) of the add.xline line(s).

add.yline

numeric value(s) where one or multiple vertical line(s) should be added.

yline.linetype

String which sets the type of line for add.yline. Defaults to "dashed", but any ggplot linetype will work.

yline.color

String that sets the color(s) of the add.yline line(s).

legend.show

Logical. Whether any legend should be displayed. Default = TRUE.

legend.density.title, legend.color.title

Strings which set the title for the legends.

legend.density.breaks, legend.color.breaks

Numeric vector which sets the discrete values to label in the density and color.by legends.

legend.density.breaks.labels, legend.color.breaks.labels

String vector, with same length as legend.*.breaks, which sets the labels for the tick marks or hex icons of the associated legend.

show.grid.lines

Logical which sets whether grid lines should be shown within the plot space.

data.out

Logical. When set to TRUE, changes the output from the plot alone to a list containing the plot ("plot"), and data.frame of the underlying data for target observations ("data"), and the ultimately used mapping of columns to given aesthetic sets, because modification of newly made columns is required for many features ("cols_used").

Many characteristics of the plot can be adjusted using discrete inputs

Colors: min.color and max.color adjust the colors for continuous data.
For discrete color.by plotting with color.method = "max", colors are instead adjusted with color.panel and/or colors & the labels of the groupings can be changed using rename.color.groups.
Titles and axes labels can be adjusted with main, sub, xlab, ylab, and legend.color.title and legend.density.title arguments.
Legends can also be adjusted in other ways, using variables that all start with "legend." for easy tab completion lookup.

Additional Features

Other tweaks and features can be added as well. Each is accessible through 'tab' autocompletion starting with "do."--- or "add."---, and if additional inputs are involved in implementing or tweaking these, the associated inputs will start with the "---.":

If do.contour is provided, density gradient contour lines will be overlaid with color and linetype adjustable via contour.color and contour.linetype.
If add.trajectory.by.groups is provided a list of vectors (each vector being group names from start-group-name to end-group-name), and a column name pointing to the relevant grouping information is provided to trajectory.group.by, then median centers of the groups will be calculated and arrows will be overlayed to show trajectory inference paths.
If add.trajectory.curves is provided a list of matrices (each matrix containing x, y coordinates from start to end), paths and arrows will be overlayed to show trajectory inference curves. Arrow size is controlled with the trajectory.arrow.size input.

Author

Daniel Bunis with some code adapted from Giuseppe D'Agostino

Details

This function first makes any requested adjustments to data in the given data_frame, internally only, such as scaling the color.by-column if color.adjustment was given "z-score".

Next, data_frame is then subset to only target rows based on the rows.use input.

Finally, a hex plot is created using this dataframe:

If color.by is not rovided, coloring is based on the density of observations within each hex bin. When color.by is provided, density is represented through opacity while coloring is based on a summarization, chosen with the color.method input, of the target color.by data.

If split.by was used, the plot will be split into a matrix of panels based on the associated groupings.

Examples

Run this code

example("dittoExampleData", echo = FALSE)

# The minimal inputs for scatterHex are the 'data_frame', and 2 column names,
#   given to 'x.by' and 'y.by', indicating which data to use for the x and y
#   axes, respectively.
scatterHex(
    example_df, x.by = "PC1", y.by = "PC2")

# 'color.by' can also be given a column name in order to represent that
#   column's data in the color of the hexes.
# Note: This capability requires the suggested package 'ggplot.multistats'.
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(
        example_df, x.by = "PC1", y.by = "PC2",
        color.by = "groups")
}
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(
        example_df, x.by = "PC1", y.by = "PC2",
        color.by = "gene1")
}

# Data can be "split" or faceted by a discrete variable as well.
scatterHex(example_df, x.by = "PC1", y.by = "PC2",
    split.by = "timepoint") # single split.by element
scatterHex(example_df, x.by = "PC1", y.by = "PC2",
    split.by = c("groups","SNP")) # row and col split.by elements

# Modify the look with intuitive inputs
scatterHex(example_df, x.by = "PC1", y.by = "PC2",
    show.grid.lines = FALSE,
    ylab = NULL, xlab = "PC2 by PC1",
    main = "Plot Title",
    sub = "subtitle",
    legend.density.title = "Items")
# 'max.density' is one of these intuitively named inputs that can be
#   extremely useful for saying "I only can for opacity to be decreased
#   in regions with exceptionally low observation numbers."
# (A good value for this in "real" data might be 10 or 50 or higher, but for
#   our sparse example data, we need to do a lot to show this off at all!)
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(
        example_df, x.by = "PC1", y.by = "PC2",
        color.by = "gene1", bins = 10,
        sub = "Default density scale")
}
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(
        example_df, x.by = "PC1", y.by = "PC2",
        color.by = "gene1", bins = 10,
        sub = "Density capped low for ignoring sparse regions",
        max.density = 2)
}

# You can restrict to only certain data points using the 'rows.use' input.
#   The input can be given rownames, indexes, or a logical vector
scatterHex(example_df, x.by = "PC1", y.by = "PC2",
    sub = "show only first 40 observations, by index",
    rows.use = 1:40)
scatterHex(example_df, x.by = "PC1", y.by = "PC2",
    sub = "show only 3 obs, by name (plotting gets a bit wonky for few points)",
    rows.use = c("obs1", "obs2", "obs25"))
scatterHex(example_df, x.by = "PC1", y.by = "PC2",
    sub = "show groups A,B,D only, by logical",
    rows.use = example_df$groups!="C")

# Many extra features are easy to add as well:
#   Each is started via an input starting with 'do.FEATURE*' or 'add.FEATURE*'
#   And when tweaks for that feature are possible, those inputs will start be
#   named starting with 'FEATURE*'. For example, color.by groups can be labeled
#   with 'do.label = TRUE' and the tweaks for this feature are given with inputs
#   'labels.size', 'labels.highlight', and 'labels.repel':
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(example_df, x.by = "PC1", y.by = "PC2", color.by = "groups",
        sub = "default labeling",
        do.label = TRUE)          # Turns on the labeling feature
}
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(example_df, x.by = "PC1", y.by = "PC2", color.by = "groups",
        sub = "tweaked labeling",
        do.label = TRUE,          # Turns on the labeling feature
        labels.size = 8,          # Adjust the text size of labels
        labels.highlight = FALSE, # Removes white background behind labels
        labels.repel = FALSE)     # Turns off anti-overlap location adjustments
}

# Faceting can also be used to show multiple continuous variables side-by-side
#   by giving a vector of column names to 'color.by'.
#   This can also be combined with 1 'split.by' variable, with direction then
#   controlled via 'multivar.split.dir':
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(example_df, x.by = "PC1", y.by = "PC2", bins = 10,
        color.by = c("gene1", "gene2"))
}
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(example_df, x.by = "PC1", y.by = "PC2", bins = 10,
        color.by = c("gene1", "gene2"),
        split.by = "groups")
}
if (requireNamespace("ggplot.multistats", quietly = TRUE)) {
    scatterHex(example_df, x.by = "PC1", y.by = "PC2", bins = 10,
        color.by = c("gene1", "gene2"),
        split.by = "groups",
        multivar.split.dir = "row")
}

# Sometimes, it can be useful for external editing or troubleshooting purposes
#   to see the underlying data that was directly used for plotting.
# 'data.out = TRUE' can be provided in order to obtain not just plot ("plot"),
#   but also the "data" and "cols_used" returned as a list.
out <- scatterHex(example_df, x.by = "PC1", y.by = "PC2",
    rows.use = 1:40,
    data.out = TRUE)
out$plot
summary(out$data)
out$cols_use

Run the code above in your browser using DataLab