cormap2: Draw correlation maps from large datasets.

Description

cormap2() generates pair-wise correlations from an input ExpressionSet object, a data.frame or a numerical matrix. With the default options it also produces a heatmap.

Usage

cormap2(
  x,
  cormat = NULL,
  lab = NULL,
  convert = TRUE,
  biomart = FALSE,
  cluster_correlations = TRUE,
  main = "",
  postfix = NULL,
  cex = NULL,
  na.frac = 0.1,
  cor.cluster = 1,
  cor.window = NULL,
  cor.thr = NULL,
  cor.mar = 0.5,
  cut.thr = NULL,
  cut.size = 5,
  autoadj = TRUE,
  labelheight = NULL,
  labelwidth = NULL,
  add.sig = FALSE,
  genes2highl = NULL,
  order.list = TRUE,
  doPlot = TRUE,
  updateProgress = NULL,
  verbose = FALSE
)

Arguments

(ExpressionSet, data.frame or numeric). A numeric data frame, matrix or an ExpressionSet object.

cormat

(numeric). A correlation matrix. If this not NULL then x is ignored. Defaults to NULL.

lab

(character). Optional row/column labels for the heatmap. Defaults to NULL meaning the row names of the input data are used. Note that the order of the labels must match the order of the row names of the input data!

convert

(logical). Should an attempt be made to convert IDs provided as row names of the input or in lab? Defaults to TRUE. Conversion will be done using BioMart or an annotation package, depending on biomart.

biomart

(logical). Should BioMart (or an annotation package) be used to convert IDs? If TRUE the todisp2 function in package convertid attempts to access the BioMart API to convert ENSG IDs to Gene Symbols Defaults to FALSE which will use the traditional AnnotationDbi Bimap interface.

cluster_correlations

(logical). Should the correlation matrix be clustered before plotting? Defaults to TRUE.

main

(character). The main title of the plot. Defaults to "".

postfix

(character of logical). A plot sub-title. Will be printed below the main title. Defaults to NULL.

cex

(numeric). Font size. Defaults to 0.5 if autoadj is FALSE. See 'Details'.

na.frac

(numeric). Fraction of missing values allowed per row of the input matrix. Defaults to 0.1 which means LESS than 10 per cent of the values in one row are allowed to be NAs.

cor.cluster

(numeric). The correlation cluster along the diagonal 'line' in the heatmap that should be zoomed into. A sliding window of size cor.window will be moved along the diagonal of the correlation matrix to find the cluster with the most corelation values meeting core.thr. Defaults to 1.

cor.window

(numeric). The size of the sliding window (see cor.cluster). Defaults to NULL. Note that this works only for positive correlations.

cor.thr

(numeric). Correlation threshold to filter the correlation matrix for plotting. Defaults to NULL meaning no filtering. Note that this value will be applied to margin cor.mar of the values per row.

cor.mar

(numeric). Margin of the values per row of the correlation matrix the cor.thr filter needs to meet. Defaults to 0.5 meaning at least 50 per cent of the values in a row need to meet the threshold in order to keep the row.

cut.thr

(numeric). Threshold at which dendrogram branches are to be cut. Passed on to argument cutHeight in cutreeStatic. Defaults to NULL meaning no cutting.

cut.size

(numeric). Minimum number of objects on a dendrogram branch considered a cluster. Passed on to argument minSize in cutreeStatic. Defaults to 5.

autoadj

(logical). Should plot measures be adjusted automatically? Defaults to TRUE.

labelheight

(numeric or lcm(numeric)). Relative or absolute height (using lcm, see layout) of the labels. Defaults to 0.2 if autoadj is FALSE. See 'Details'.

labelwidth

(numeric or lcm(numeric)). Relative or absolute width (using lcm, see layout) of the labels. Defaults to 0.2 if autoadj is FALSE. See 'Details'.

add.sig

(logical). Should significance asterisks be drawn? If TRUE P-Values for correlation significance are calculated and encoded as asterisks. See 'Details'.

genes2highl

(character). Vector of gene symbols (or whatever labels are used) to be highlighted. If not NULL will draw a semi-transparent rectangle around the labels and rows or columns in the heatmap labels.

order.list

(logical). Should the order of the correlation matrix, i.e. the 'list' of labels be reversed? Meaningful if the order of input variables should be preserved because image turns the input matrix. Defaults to TRUE.

doPlot

(logical). Draw the plot? Defaults to TRUE.

updateProgress

(function). Function for updating a progress bar in a Shiny web application. This was added here for the BioCPR application.

verbose

(logical). Should verbose output be written to the console? Defaults to FALSE.

Value

Invisibly returns the correlation matrix, though the function is mainly called for its side-effect of producing a heatmap (if doPlot = TRUE which is the default).

Details

P-Values are calculated from the t-test value of the correlation coefficient: \(t = r x sqrt(n-2) / sqrt(1-r^2)\), where r is the correlation coefficient, n is the number of samples with no missing values for each gene (row-wise ncol(eset) minus the number of columns that have an NA). P-Values are the calculated using pt and corrected account for the two-tailed nature of the test, i.e., the possibility of positive as well as negative correlation. The approach to calculate correlation significance was adopted from Miles, J., & Banyard, P. (2007) on "Calculating the exact significance of a Pearson correlation in MS Excel".

The asterisks encode significance as follows:

	P < 0.05: *
	P < 0.01: **
	P < 0.001: ***

The label measures (labelheight, labelwidth and cex) are adjusted automatically by default with argument autoadj=TRUE and have default values which are hard coded into the helper function heatmap.cor. The values calculated by the helper function plotAdjust can be overridden by setting any of those arguments to a valid numeric or lcm(numeric) value.

References

Miles, J., & Banyard, P. (2007). Understanding and using statistics in psychology: A practical introduction. Sage Publications Ltd. https://psycnet.apa.org/record/2007-06525-000.

Examples

Run this code

# NOT RUN {
# 1. Generate a random 10x10 matrix with two distinct sets and plot it with
# default settings without ID conversion since the IDs are made up:
set.seed(1234)
mat <- matrix(c(rnorm(100, mean = 1), rnorm(100, mean = -1)), nrow = 20)
rownames(mat) <- paste0("gene-", 1:20)
colnames(mat) <- paste0(c("A", "B"), rep(1:5, 2))
cormap2(mat, convert=FALSE, main="Random matrix")

# 2. Use a real-world dataset from TCGA (see README file in inst/extdata directory).
# Package 'convertid' is used to convert Ensembl Gene IDs to HGNC Symbols
## Read data and prepare input data frame
fl <- system.file("extdata", "PrCaTCGASample.txt", package = "coreheat", mustWork = TRUE)
dat0 <- read.delim(fl, stringsAsFactors=FALSE)
dat1 <- data.frame(dat0[, grep("TCGA", names(dat0))], row.names=dat0$ensembl_gene_id)
cormap2(dat1, main="TCGA data frame + ID conversion")

# 3. Use separately supplied IDs with a matrix created from the data frame of the
# previous example and highlight genes of interest
dat2 <- as.matrix(dat0[, grep("TCGA", names(dat0))])
sym <- dat0$hgnc_symbol
cormap2(dat1, convert=FALSE, lab=sym, genes2highl=c("GNAS","NCOR1","AR", "ATM"),
main="TCGA matrix + custom labels")

# 4. Use an ExpressionSet object and add significance asterisks
## For simplicity reasons we create the ExpressionSet from a matrix created
## from the data frame in the second example
expr <- Biobase::ExpressionSet(as.matrix(dat1))
cormap2(expr, add.sig=TRUE, main="TCGA ExpressionSet object + ID conversion")

# More examples can be found in the vignette.
# }

Run the code above in your browser using DataLab