scatterPlot: Show RNAseq data overlayed on a scatter plot

Description

Show RNAseq data overlayed on a scatter plot

Usage

scatterPlot(
  data_frame,
  x.by,
  y.by,
  color.by = NULL,
  shape.by = NULL,
  split.by = NULL,
  size = 1,
  rows.use = NULL,
  show.others = TRUE,
  x.adjustment = NULL,
  y.adjustment = NULL,
  color.adjustment = NULL,
  x.adj.fxn = NULL,
  y.adj.fxn = NULL,
  color.adj.fxn = NULL,
  split.show.all.others = TRUE,
  opacity = 1,
  color.panel = dittoColors(),
  colors = seq_along(color.panel),
  split.nrow = NULL,
  split.ncol = NULL,
  split.adjust = list(),
  multivar.split.dir = c("col", "row"),
  shape.panel = c(16, 15, 17, 23, 25, 8),
  rename.color.groups = NULL,
  rename.shape.groups = NULL,
  min.color = "#F0E442",
  max.color = "#0072B2",
  min.value = NA,
  max.value = NA,
  plot.order = c("unordered", "increasing", "decreasing", "randomize"),
  xlab = x.by,
  ylab = y.by,
  main = "make",
  sub = NULL,
  theme = theme_bw(),
  do.hover = FALSE,
  hover.data = unique(c(color.by, paste0(color.by, ".color.adj"), "color.multi",
    "color.which", x.by, paste0(x.by, ".x.adj"), y.by, paste0(y.by, ".y.adj"), shape.by,
    split.by)),
  hover.round.digits = 5,
  do.contour = FALSE,
  contour.color = "black",
  contour.linetype = 1,
  add.trajectory.by.groups = NULL,
  add.trajectory.curves = NULL,
  trajectory.group.by,
  trajectory.arrow.size = 0.15,
  add.xline = NULL,
  xline.linetype = "dashed",
  xline.color = "black",
  add.yline = NULL,
  yline.linetype = "dashed",
  yline.color = "black",
  do.letter = FALSE,
  do.ellipse = FALSE,
  do.label = FALSE,
  labels.size = 5,
  labels.highlight = TRUE,
  labels.repel = TRUE,
  labels.repel.adjust = list(),
  labels.split.by = split.by,
  legend.show = TRUE,
  legend.color.title = "make",
  legend.color.size = 5,
  legend.color.breaks = waiver(),
  legend.color.breaks.labels = waiver(),
  legend.shape.title = shape.by,
  legend.shape.size = 5,
  show.grid.lines = TRUE,
  do.raster = FALSE,
  raster.dpi = 300,
  data.out = FALSE
)

Value

a ggplot scatterplot where colored dots and/or shapes represent individual rows of the given data_frame.

Alternatively, if data.out=TRUE, a list containing four slots is output: the plot (named 'p'), a data.frame containing the underlying data for target rows (named 'Target_data'), a data.frame containing the underlying data for non-target rows (named 'Others_data'), and a list providing mappings of final column names in 'Target_data' to given plot aesthetics (named 'cols_used') because modification of newly made columns is required for many features.

Alternatively, if do.hover is set to TRUE, the plot is coverted from ggplot to plotly & additional information about each data point, determined by the hover.data input, is displayed upon hovering the cursor over the plot.

Arguments

data_frame

A data_frame where columns are features and rows are observations you might wish to visualize.

x.by, y.by

Single strings denoting the name of a column of data_frame containing numeric data to use for the x- and y-axis of the scatterplot.

color.by

Single string denoting the name of a column of data_frame to use for setting the color of plotted points. Alternatively, a string vector naming multiple such columns of data to plot at once.

shape.by

Single string denoting the name of a column of data_frame containing discrete data to use for setting the shape of plotted points.

split.by

1 or 2 strings denoting the name(s) of column(s) of data_frame containing discrete data to use for faceting / separating data points into separate plots.

When 2 columns are named, c(row,col), the first is used as rows and the second is used for columns of the resulting facet grid.

When 1 column is named, shape control can be achieved with split.nrow and split.ncol

size

Number which sets the size of data points. Default = 1.

rows.use

String vector of rownames of data_frame OR an integer vector specifying the row-indices of data points which should be plotted.

Alternatively, a Logical vector, the same length as the number of rows in data_frame, where TRUE values indicate which rows to plot.

show.others

Logical. TRUE by default, whether rows not targeted by rows.use should be shown in the background in light gray.

x.adjustment, y.adjustment, color.adjustment

A recognized string indicating whether numeric x.by, y.by, and color.by data should be used directly (default) or should be adjusted to be

"z-score": scaled with the scale() function to produce a relative-to-mean z-score representation
"relative.to.max": divided by the maximum value to give percent of max values between [0,1]

Ignored if the target data is not numeric as these known adjustments target numeric data only.

In order to leave the unedited data available for use in other features, the adjusted data are put in a new column and that new column is used for plotting.

x.adj.fxn, y.adj.fxn, color.adj.fxn

If you wish to apply a function to edit the x.by, y.by, or color.by data before use, in a way not possible with the color.adjustment input, this input can be given a function which takes in a vector of values as input and returns a vector of values of the same length as output.

For example, function(x) {log2(x)} or as.factor.

In order to leave the unedited data available for use in other features, the adjusted data are put in a new column and that new column is used for plotting.

split.show.all.others

Logical which sets whether gray "others" points of facets should include all points of other facets (TRUE) versus just points left out by rows.use which would exist in the current facet (FALSE).

opacity

Number between 0 and 1. 1 = opaque. 0 = invisible. Default = 1. (In terms of typical ggplot variables, = alpha)

color.panel

String vector which sets the colors to draw from when color.by indicates discrete data. dittoColors() by default, see dittoColors for contents.

A named vector can be used if names are matched to the distinct values of the color.by data.

colors

Integer vector, the indexes / order, of colors from color.panel to actually use.

Useful for quickly swapping around colors of the default set (when not using names for color matching).

split.nrow, split.ncol

Integers which set the dimensions of faceting/splitting when faceting by a single feature.

split.adjust

A named list which allows extra parameters to be pushed through to the faceting function call. List elements should be valid inputs to the faceting functions, e.g. `list(scales = "free")`.

For options, when giving 1 column to split.by, see facet_wrap, OR when giving 2 columns to split.by, see facet_grid.

multivar.split.dir

"row" or "col", sets the direction of faceting used for 'var' values when:

var is given multiple column names
AND split.by is used to provide an additional feature to facet by

shape.panel

Vector of integers, corresponding to ggplot shapes, which sets what shapes to use in conjunction with shape.by. When nothing is supplied to shape.by, only the first value is used. Default is a set of 6, c(16,15,17,23,25,8), the first being a simple, solid, circle.

rename.color.groups

String vector which sets new names for the identities of color.by groups.

rename.shape.groups

String vector which sets new names for the identities of shape.by groups.

min.color

color for min value of numeric color.by-data. Default = yellow

max.color

color for max value of numeric color.by-data. Default = blue

min.value, max.value

Number which sets the color.by-data value associated with the minimum or maximum colors.

plot.order

String. If the data should be plotted based on the order of the color data, sets whether to plot in "increasing", "decreasing", or "randomize"d order.

xlab, ylab

Strings which set the labels for the axes. To remove, set to NULL.

main

String, sets the plot title. A default title is automatically generated based on color.by and shape.by when either are provided. To remove, set to NULL.

sub

String, sets the plot subtitle.

theme

A ggplot theme which will be applied before internal adjustments. Default = theme_bw(). See https://ggplot2.tidyverse.org/reference/ggtheme.html for other options and ideas.

do.hover

Logical which controls whether the ggplot output will be converted to a plotly object so that data about individual points can be displayed when you hover your cursor over them. The hover.data argument is used to determine what data to show upon hover.

hover.data

String vector which denotes what data to show for each data point, upon hover, when do.hover is set to TRUE. Defaults to all data expected to be useful. Only values present in the plotting data are actually used. These can be column names of data_frame and any column names which will be created to accommodate multivar and data adjustment functionality. You can run the function with data.out = TRUE and inspect the $Target_data output's columns to view your available options.

hover.round.digits

Integer number specifying the number of decimal digits to round displayed numeric values to, when do.hover is set to TRUE.

do.contour

Logical. Whether density-based contours should be displayed.

contour.color

String that sets the color of the do.contour contours.

contour.linetype

String or numeric which sets the type of line used for do.contour contours. Defaults to "solid", but see linetype for other options.

add.trajectory.by.groups

List of vectors representing trajectory paths, each from start-group to end-group, where vector contents are the group-names indicated by the trajectory.group.by column of data_frame.

add.trajectory.curves

List of matrices, each representing coordinates for a trajectory path, from start to end, where matrix columns represent x and y coordinates of the paths.

trajectory.group.by

String denoting the name of a column of data_frame to use for generating trajectories from data point groups.

trajectory.arrow.size

Number representing the size of trajectory arrows, in inches. Default = 0.15.

add.xline

numeric value(s) where one or multiple vertical line(s) should be added.

xline.linetype

String which sets the type of line for add.xline. Defaults to "dashed", but any ggplot linetype will work.

xline.color

String that sets the color(s) of the add.xline line(s).

add.yline

numeric value(s) where one or multiple vertical line(s) should be added.

yline.linetype

String which sets the type of line for add.yline. Defaults to "dashed", but any ggplot linetype will work.

yline.color

String that sets the color(s) of the add.yline line(s).

do.letter

Logical which sets whether letters should be added on top of the colored dots. For extended colorblindness compatibility. NOTE: do.letter is ignored if do.hover = TRUE or shape.by is used because lettering is incompatible with plotly and with changing the dots' to be different shapes.

do.ellipse

Logical. Whether color.by groups should be surrounded by median-centered ellipses.

do.label

Logical. Whether to add text labels near the center (median) of color.by groups.

labels.size

Number which sets the size of labels text when do.label = TRUE.

labels.highlight

Logical. Whether labels should have a box behind them when do.label = TRUE.

labels.repel

Logical, that sets whether the labels' placements will be adjusted with ggrepel to avoid intersections between labels and plot bounds when do.label = TRUE. TRUE by default.

labels.repel.adjust

A named list which allows extra parameters to be pushed through to ggrepel function calls. List elements should be valid inputs to the geom_label_repel by default, or geom_text_repel when labels.highlight = FALSE.

labels.split.by

String of one or two column names which controls the facet-split calculations for label placements. Defaults to split.by, so generally there is no need to adjust this except when if you plan to apply faceting externally.

legend.show

Logical. Whether any legend should be displayed. Default = TRUE.

legend.color.title, legend.shape.title

Strings which set the title for the color or shape legends.

legend.color.size, legend.shape.size

Numbers representing the size of shapes in the color and shape legends (for discrete variable plotting). Default = 5. *Enlarging the icons in the colors legend is incredibly helpful for making colors more distinguishable by color blind individuals.

legend.color.breaks

Numeric vector which sets the discrete values to label in the color-scale legend for color.by-data.

legend.color.breaks.labels

String vector, with same length as legend.color.breaks, which sets the labels for the tick marks of the color-scale.

show.grid.lines

Logical which sets whether grid lines should be shown within the plot space.

do.raster

Logical. When set to TRUE, rasterizes the internal plot layer, changing it from individually encoded points to a flattened set of pixels. This can be useful for editing in external programs (e.g. Illustrator) when there are many thousands of data points.

raster.dpi

Number indicating dots/pixels per inch (dpi) to use for rasterization. Default = 300.

data.out

Logical. When set to TRUE, changes the output, from the plot alone, to a list containing the plot ("p"), a data.frame containing the underlying data for target rows ("Target_data"), a data.frame containing the underlying data for non-target rows ("Others_data"), and the ultimately used mapping of columns to given aesthetic sets ("cols_used"), because modification of newly made columns is required for many features.

Many characteristics of the plot can be adjusted using discrete inputs

size and opacity can be used to adjust the size and transparency of the data points. size can be given a number, or a column name of data_frame.
Colors used can be adjusted with color.panel and/or colors for discrete data, or min, max, min.color, and max.color for continuous data.
Shapes used can be adjusted with shape.panel.
Color and shape labels can be changed using rename.color.groups and rename.shape.groups.
Titles and axes labels can be adjusted with main, sub, xlab, ylab, and legend.title arguments.
Legends can also be adjusted in other ways, using variables that all start with "legend." for easy tab completion lookup.

Author

Daniel Bunis

Details

This function first makes any requested adjustments to data in the given data_frame, internally only, such as scaling the color.by-column if color.adjustment was given "z-score".

Next, if a set of rows to target was indicated with the rows.use input, then the data_frame is split into Target_data and Others_data.

Then, rows are reordered to match with the requested plot.order behavior.

Finally, a scatter plot is created from the resultant data.frames. Non-target data points are colored in gray if show.others=TRUE, and target data points are displayed on top, colored and shaped based on the color.by- and shape.by-associated data. If split.by was used, the plot will be split into a matrix of panels based on the associated groupings.

Examples

Run this code

example("dittoExampleData", echo = FALSE)

# The minimal inputs for scatterPlot are the 'data_frame', and 2 column names,
#   given to 'x.by' and 'y.by', indicating which data to use for the x and y
#   axes, respectively.
scatterPlot(
    example_df, x.by = "PC1", y.by = "PC2")

# 'color.by' and/or 'shape.by' can also be given column names in order to
#   show represent that columns data in the color or shape of the data points.
#   'shape.by' must be pointed to discrete data, but 'color.by' can be given
#   discrete or numeric data.
scatterPlot(
    example_df, x.by = "PC1", y.by = "PC2",
    color.by = "groups",
    shape.by = "SNP",
    size = 3)
scatterPlot(
    example_df, x.by = "PC1", y.by = "PC2",
    color.by = "gene1",
    size = 3)

# Data can be "split" or faceted by a discrete variable as well.
scatterPlot(example_df, x.by = "PC1", y.by = "PC2", color.by = "gene1",
    split.by = "timepoint") # single split.by element
scatterPlot(example_df, x.by = "PC1", y.by = "PC2", color.by = "gene1",
    split.by = c("groups","SNP")) # row and col split.by elements

# Modify the look with intuitive inputs
scatterPlot(example_df, x.by = "PC1", y.by = "PC2", color.by = "groups",
    size = 5,
    opacity = 0.3,
    show.grid.lines = FALSE,
    ylab = NULL, xlab = "PC2 by PC1",
    main = "Plot Title",
    sub = "subtitle",
    legend.color.title = "Legend\nRetitle")

# You can restrict to only certain data points using the 'rows.use' input.
#   The input can be given rownames, indexes, or a logical vector
#   All "other" points will now only be shown as a gray background, or will not
#   be shown add all if you also add 'show.others = FALSE'
scatterPlot(example_df, x.by = "PC1", y.by = "PC2", color.by = "groups",
    sub = "show only first 40 observations, by index",
    rows.use = 1:40)
scatterPlot(example_df, x.by = "PC1", y.by = "PC2", color.by = "groups",
    sub = "show only 3 observations, by name",
    rows.use = c("obs1", "obs2", "obs25"))
scatterPlot(example_df, x.by = "PC1", y.by = "PC2", color.by = "groups",
    sub = "show groups A,B,D only, by logical, without others as background",
    rows.use = example_df$groups!="C",
    show.others = FALSE)

# Many extra features are easy to add as well:
#   Each is started via an input starting with 'do.FEATURE*' or 'add.FEATURE*'
#   And when tweaks for that feature are possible, those inputs will start be
#   named starting with 'FEATURE*'. For example, color.by groups can be labeled
#   with 'do.label = TRUE' and the tweaks for this feature are given with inputs
#   'labels.size', 'labels.highlight', and 'labels.repel':
scatterPlot(example_df, x.by = "PC1", y.by = "PC2", color.by = "groups",
    sub = "default labeling",
    do.label = TRUE)          # Turns on the labeling feature
scatterPlot(example_df, x.by = "PC1", y.by = "PC2", color.by = "groups",
    sub = "tweaked labeling",
    do.label = TRUE,          # Turns on the labeling feature
    labels.size = 8,          # Adjust the text size of labels
    labels.highlight = FALSE, # Removes white background behind labels
    labels.repel = FALSE)     # Turns off anti-overlap location adjustments

# Faceting can also be used to show multiple continuous variables side-by-side
#   by giving a vector of column names to 'color.by'.
#   This can also be combined with 1 'split.by' variable, with direction then
#   controlled via 'multivar.split.dir':
scatterPlot(example_df, x.by = "PC1", y.by = "PC2",
    color.by = c("gene1", "gene2"))
scatterPlot(example_df, x.by = "PC1", y.by = "PC2",
    color.by = c("gene1", "gene2"),
    split.by = "groups")
scatterPlot(example_df, x.by = "PC1", y.by = "PC2",
    color.by = c("gene1", "gene2"),
    split.by = "groups",
    multivar.split.dir = "row")

# Sometimes, it can be useful for external editing or troubleshooting purposes
#   to see the underlying data that was directly used for plotting.
# 'data.out = TRUE' can be provided in order to obtain not just plot ("plot"),
#   but also the "Target_data" and "Others_data" data.frames and "cols_used"
#   returned as a list.
out <- scatterPlot(example_df, x.by = "PC1", y.by = "PC2", color.by = "groups",
    rows.use = 1:40,
    data.out = TRUE)
out$plot
summary(out$Target_data)
summary(out$Others_data)
out$cols_used

Run the code above in your browser using DataLab