cormap_filt: Automatically split clusters based on noise level and hierarchy

Description

cormat_filt splits (cuts) the dendrogram at a given threshold dividing it into larger or smaller "sub-clusters". Correlation P-Values (see eset_cor) are converted to represent significance as a sub-cluster-wise signal metric used for filtering. Optionally, up to 3 plots are produced, the third one being a filtered heatmap based on significance and three height cutting.

Usage

cormap_filt(
  x,
  na.frac = 0.1,
  method = "ward.D",
  do.abs = TRUE,
  main = "correlation map",
  postfix = NULL,
  p.thr = 0.01,
  cex = 0.2,
  cex.clust = cex,
  cex.filt = cex,
  cut.thr = NULL,
  cor.thr = NULL,
  cor.cluster = 1,
  cor.window = NULL,
  do.plots = c("dend", "full.heat", "filt.heat"),
  genes2highl = NULL,
  order.list = TRUE,
  convert = TRUE,
  biomart = FALSE,
  add.sig = FALSE,
  verbose = FALSE
)

Arguments

(ExpressionSet, data.frame or numeric). A numeric data frame, matrix or an ExpressionSet object.

na.frac

(numeric). Fraction of missing values allowed per row of the input matrix. Defaults to 0.1 which means LESS than 10 per cent of the values in one row are allowed to be NAs.

method

(character). The agglomeration method used for clustering. See help for hclust. Defaults to "ward.D".

do.abs

(logical). Should the distances for clustering be calculated based on the absolute correlation values? In other words, should the sign of the correlation be ignored in favor of its strength?

main

(character). The main title of the plot. Defaults to "".

postfix

(character of logical). A plot sub-title. Will be printed below the main title. Defaults to NULL.

p.thr

(numeric). P-Value threshold for filtering sub-clusterd with significant correlations. Defaults to 0.01.

cex

(numeric). Font size for the heatmap of the unfiltered correlation matrix. Defaults to 0.2.

cex.clust

(numeric). Font size for the dendrogram plot of the unfiltered correlation matrix clusters. Defaults to cex.

cex.filt

(numeric). Font size for the heatmap of the filtered correlation matrix. Defaults to cex.

cut.thr

(numeric). Threshold at which dendrogram branches are to be cut. Passed on to argument h in cut.dendrogram. Defaults to NULL meaning no cutting.

cor.thr

(numeric). Correlation threshold to filter the correlation matrix for plotting. Defaults to NULL meaning no filtering. Note that this value will be applied to margin cor.mar of the values per row.

cor.cluster

(numeric). The correlation cluster along the diagonal 'line' in the heatmap that should be zoomed into. A sliding window of size cor.window will be moved along the diagonal of the correlation matrix to find the cluster with the most corelation values meeting core.thr. Defaults to 1.

cor.window

(numeric). The size of the sliding window (see cor.cluster). Defaults to NULL. Note that this works only for positive correlations.

do.plots

(character). The plots to be produced. A character vector containing one or more of "dend" to produce the dendrogram plot, "full.heat" to produce the heatmap of the unfiltered correlation matrix, and "filt.heat" to produce the heatmap of the filtered correlation matrix. Defaults to all three plots.

genes2highl

(character). Vector of gene symbols (or whatever labels are used) to be highlighted. If not NULL will draw a semi-transparent rectangle around the labels and rows or columns in the heatmap labels.

order.list

(logical). Should the order of the correlation matrix, i.e. the 'list' of labels be reversed? Meaningful if the order of input variables should be preserved because image turns the input matrix. Defaults to TRUE.

convert

(logical). Should an attempt be made to convert IDs provided as row names of the input or in lab? Defaults to TRUE. Conversion will be done using BioMart or an annotation package, depending on biomart.

biomart

(logical). Should BioMart (or an annotation package) be used to convert IDs? If TRUE the todisp2 function in package convertid attempts to access the BioMart API to convert ENSG IDs to Gene Symbols Defaults to FALSE which will use the traditional AnnotationDbi Bimap interface.

add.sig

(logical). Should significance asterisks be drawn? If TRUE P-Values for correlation significance are calculated and encoded as asterisks. See 'Details'.

verbose

(logical). Should verbose output be written to the console? Defaults to FALSE.

Value

A list. If the dendrogram is being cut, i.e., cut.thr is not NULL, a list of

	clusters: the list of cluster labels from `lower` component of the `cut.dendrogram` output which is list with the branches obtained from cutting the tree
	filt: the index of the cluster labels passing the signal metrics threshold
	filt_cluster: the list of the filtered cluster labels
	h: the cut threshold
	p.thr: the P-Value threshold for filtering sub-clusters
	metric: the signal metrics for all sub-clusters
	cormat: the clustered (ordered) correlation matrix
	hclust: a list of hierarchical clustering metrics (output of `hclust`)
	pvalues: the correlation P-Value matrix

If no tree cutting is applied, a list of

	cormat: the clustered (ordered) correlation matrix
	hclust: a list of hierarchical clustering metrics (output of `hclust`)
	pvalues: the correlation P-Value matrix

Details

P-Values are calculated from the t-test value of the correlation coefficient: \(t = r x sqrt(n-2) / sqrt(1-r^2)\), where r is the correlation coefficient, n is the number of samples with no missing values for each gene (row-wise ncol(eset) minus the number of columns that have an NA). P-Values are then calculated using pt and corrected account for the two-tailed nature of the test, i.e., the possibility of positive as well as negative correlation. The approach to calculate correlation significance was adopted from Miles, J., & Banyard, P. (2007) on "Calculating the exact significance of a Pearson correlation in MS Excel".

To obtain a suitable metric for isolating significant sub-clusters, P-Values are represented as \(-log10(median(pval))\) where pval is the median of the parallel maximum of all P-Values belonging to the sub-cluster and 1e-38 to avoid values of zero (0).