Use biweight mid-correlatio when you want a Pearson-like measure that is robust to outliers and heavy-tailed noise. Bicor down-weights extreme observations via Tukey’s biweight while preserving location/scale invariance, making it well suited to high-throughput data (e.g., gene expression) where occasional gross errors or platform artefacts occur. Prefer Spearman/Kendall for purely ordinal structure or strongly non-linear monotone relations.
Prints a matrix with a compact header, optional truncation for large matrices, and a small summary of off-diagonal values.
Produces a ggplot2 heatmap of the biweight mid-correlation matrix. Optionally reorders variables via hierarchical clustering on \(1 - r_{\text{bicor}}\), and can show only a triangle.
biweight_mid_corr(
data,
c_const = 9,
max_p_outliers = 1,
pearson_fallback = c("hybrid", "none", "all"),
na_method = c("error", "pairwise"),
mad_consistent = FALSE,
w = NULL,
sparse_threshold = NULL,
n_threads = getOption("matrixCorr.threads", 1L)
)diag.biweight_mid_corr(x, ...)
# S3 method for biweight_mid_corr
print(
x,
digits = 4,
max_rows = NULL,
max_cols = NULL,
width = getOption("width", 80L),
na_print = "NA",
...
)
# S3 method for biweight_mid_corr
plot(
x,
title = "Biweight mid-correlation heatmap",
reorder = c("none", "hclust"),
triangle = c("full", "lower", "upper"),
low_color = "indianred1",
mid_color = "white",
high_color = "steelblue1",
value_text_size = 3,
na_fill = "grey90",
...
)
A symmetric correlation matrix with class biweight_mid_corr
(or a dgCMatrix if sparse_threshold is used), with attributes:
method = "biweight_mid_correlation", description,
and package = "matrixCorr". Downstream code should be prepared to
handle either a dense numeric matrix or a sparse dgCMatrix.
Internally, all medians/MADs, Tukey weights, optional pairwise-NA handling,
and OpenMP loops are implemented in the C++ helpers
(bicor_*_cpp()), so the R wrapper mostly validates arguments and
dispatches to the appropriate backend.
Invisibly returns x.
A ggplot object.
A numeric matrix or a data frame containing numeric columns.
Factors, logicals and common time classes are dropped in the data-frame
path. Missing values are not allowed unless na_method = "pairwise".
Positive numeric. Tukey biweight tuning constant applied to the
raw MAD; default 9 (Langfelder & Horvath’s convention).
Numeric in (0, 1]. Optional cap on the maximum
proportion of outliers on each side; if < 1, side-specific
rescaling maps those quantiles to |u|=1. Use 1 to disable.
Character scalar indicating the fallback policy. One of:
"hybrid" (default): if a column has MAD = 0, that column
uses Pearson standardisation, yielding a hybrid correlation.
"none": return NA if a column has MAD = 0 or becomes
degenerate after weighting.
"all": force ordinary Pearson for all columns.
One of "error" (default, fastest) or "pairwise".
With "pairwise", each \((j,k)\) correlation is computed on the
intersection of non-missing rows for the pair.
Logical; if TRUE, use the normal-consistent MAD
(MAD_raw * 1.4826) in the bicor weights. Default FALSE to
match Langfelder & Horvath (2012).
Optional non-negative numeric vector of length nrow(data)
giving row weights. When supplied, weighted medians/MADs are used
and Tukey weights are multiplied by w before normalisation.
Optional numeric \(\geq 0\). If supplied, sets
entries with |r| < sparse_threshold to 0 and returns a sparse
"ddiMatrix" (requires Matrix).
Integer \(\geq 1\). Number of OpenMP threads. Defaults to
getOption("matrixCorr.threads", 1L).
An object of class biweight_mid_corr.
Additional arguments passed to ggplot2::theme() or other layers.
Integer; number of decimal places used for the matrix.
Optional integer; maximum number of rows to display (default shows all).
Optional integer; maximum number of columns to display (default shows all).
Integer; target console width for wrapping header text.
Character; how to display missing values.
Plot title. Default is "Biweight mid-correlation heatmap".
Character; one of "none" (default) or "hclust".
If "hclust", variables are reordered by complete-linkage clustering
on the distance \(d = 1 - r\), after replacing NA by 0 for
clustering purposes only.
One of "full" (default), "lower", or "upper"
to display the full matrix or a single triangle.
Colours for the gradient in
scale_fill_gradient2. Defaults are "indianred1",
"white", "steelblue1".
Numeric; font size for cell labels. Set to NULL
to suppress labels (recommended for large matrices).
Fill colour for NA cells. Default "grey90".
Thiago de Paula Oliveira
For a column \(x = (x_a)_{a=1}^m\), let \(\mathrm{med}(x)\) be the median and
\(\mathrm{MAD}(x) = \mathrm{med}(|x - \mathrm{med}(x)|)\) the (raw) median
absolute deviation. If mad_consistent = TRUE, the consistent scale
\(\mathrm{MAD}^\star(x) = 1.4826\,\mathrm{MAD}(x)\) is used. With tuning constant
\(c>0\), define
$$u_a = \frac{x_a - \mathrm{med}(x)}{c\,\mathrm{MAD}^{(\star)}(x)}.$$
The Tukey biweight gives per-observation weights
$$w_a = (1 - u_a^2)^2\,\mathbf{1}\{|u_a| < 1\}.$$
Robust standardisation of a column is
$$\tilde x_a =
\frac{(x_a - \mathrm{med}(x))\,w_a}{
\sqrt{\sum_{b=1}^m \big[(x_b - \mathrm{med}(x))\,w_b\big]^2}}.$$
For two columns \(x,y\), the biweight mid-correlation is
$$\mathrm{bicor}(x,y) = \sum_{a=1}^m \tilde x_a\,\tilde y_a \in [-1,1].$$
Capping the maximum proportion of outliers (max_p_outliers).
If max_p_outliers < 1, let \(q_L = Q_x(\text{max\_p\_outliers})\) and
\(q_U = Q_x(1-\text{max\_p\_outliers})\) be the lower/upper quantiles of \(x\).
If the corresponding \(|u|\) at either quantile exceeds 1, \(u\) is rescaled
separately on the negative and positive sides so that those quantiles land at
\(|u|=1\). This guarantees that all observations between the two quantiles receive
positive weight. Note the bound applies per side, so up to \(2\,\text{max\_p\_outliers}\)
of observations can be treated as outliers overall.
Fallback when for zero MAD / degeneracy (pearson_fallback).
If a column has \(\mathrm{MAD}=0\) or the robust denominator becomes zero,
the following rules apply:
"none" when correlations involving that column are NA (diagonal
remains 1).
"hybrid" when only the affected column switches to Pearson standardisation
\(\bar x_a = (x_a - \overline{x}) / \sqrt{\sum_b (x_b - \overline{x})^2}\),
yielding the hybrid correlation
$$\mathrm{bicor}_{\mathrm{hyb}}(x,y) = \sum_a \bar x_a\,\tilde y_a,$$
with the other column still robust-standardised.
"all" when all columns use ordinary Pearson standardisation; the result
equals stats::cor(…, method="pearson") when the NA policy matches.
Handling missing values (na_method).
"error" (default): inputs must be finite; this yields a symmetric,
positive semidefinite (PSD) matrix since \(R = \tilde X^\top \tilde X\).
"pairwise": each \(R_{jk}\) is computed on the intersection of
rows where both columns are finite. Pairs with fewer than 5 overlapping
rows return NA (guarding against instability). Pairwise deletion can
break PSD, as in the Pearson case.
Row weights (w).
When w is supplied (non-negative, length \(m\)), the weighted median
\(\mathrm{med}_w(x)\) and weighted MAD
\(\mathrm{MAD}_w(x) = \mathrm{med}_w(|x - \mathrm{med}_w(x)|)\) are used to form
\(u\). The Tukey weights are then multiplied by the observation weights prior
to normalisation:
$$\tilde x_a =
\frac{(x_a - \mathrm{med}_w(x))\,w_a\,w^{(\mathrm{obs})}_a}{
\sqrt{\sum_b \big[(x_b - \mathrm{med}_w(x))\,w_b\,w^{(\mathrm{obs})}_b\big]^2}},$$
where \(w^{(\mathrm{obs})}_a \ge 0\) are the user-supplied row weights and \(w_a\)
are the Tukey biweights built from the weighted median/MAD. Weighted pairwise
behaves analogously on each column pair's overlap.
MAD choice (mad_consistent).
Setting mad_consistent = TRUE multiplies the raw MAD by 1.4826 inside
\(u\). Equivalently, it uses an effective tuning constant
\(c^\star = c \times 1.4826\). The default FALSE reproduces the convention
in Langfelder & Horvath (2012).
Optional sparsification (sparse_threshold).
If provided, entries with \(|r| < \text{sparse\_threshold}\) are set to 0 and the
result is returned as a "ddiMatrix" (diagonal is forced to 1). This is a
post-processing step that does not alter the per-pair estimates.
Computation and threads.
Columns are robust-standardised in parallel and the matrix is formed as
\(R = \tilde X^\top \tilde X\). n_threads selects the number of OpenMP
threads; by default it uses getOption("matrixCorr.threads", 1L).
Basic properties. \(\mathrm{bicor}(a x + b,\; c y + d) = \mathrm{sign}(ac)\,\mathrm{bicor}(x,y)\). With no missing data (and with per-column hybrid/robust standardisation), the output is symmetric and PSD. As with Pearson, affine equivariance does not hold for the associated biweight midcovariance.
Langfelder, P. & Horvath, S. (2012). Fast R Functions for Robust Correlations and Hierarchical Clustering. Journal of Statistical Software, 46(11), 1–17. tools:::Rd_expr_doi("10.18637/jss.v046.i11")
set.seed(1)
X <- matrix(rnorm(2000 * 40), 2000, 40)
R <- biweight_mid_corr(X, c_const = 9, max_p_outliers = 1,
pearson_fallback = "hybrid")
print(attr(R, "method"))
# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
view_corr_shiny(R)
}
Run the code above in your browser using DataLab