spearman_rho: Pairwise Spearman's rank correlation

Description

Computes pairwise Spearman's rank correlations for the numeric columns of a matrix or data frame using a high-performance 'C++' backend. Optional confidence intervals are available via a jackknife Euclidean-likelihood method.

Usage

spearman_rho(data, check_na = TRUE, ci = FALSE, conf_level = 0.95)
# S3 method for spearman_rho
print(
  x,
  digits = 4,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  show_ci = NULL,
  ...
)
# S3 method for spearman_rho
plot(
  x,
  title = "Spearman's rank correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  ci_text_size = 3,
  show_value = TRUE,
  ...
)
# S3 method for spearman_rho
summary(
  object,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  show_ci = NULL,
  ...
)
# S3 method for summary.spearman_rho
print(
  x,
  digits = NULL,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

Value

A symmetric numeric matrix where the (i, j)-th element is the Spearman correlation between the i-th and j-th numeric columns of the input. When ci = TRUE, the object also carries a ci attribute with elements est, lwr.ci, upr.ci, and conf.level. When pairwise-complete evaluation is used, pairwise sample sizes are stored in attr(x, "diagnostics")$n_complete.

Invisibly returns the spearman_rho object.

A ggplot object representing the heatmap.

Arguments

data: A numeric matrix or a data frame with at least two numeric columns. All non-numeric columns will be excluded. Each column must have at least two non-missing values.
check_na: Logical (default TRUE). If TRUE, the input is required to be free of NA/NaN/Inf. Set to FALSE only when the caller already handled missingness.
ci: Logical (default FALSE). If TRUE, attach jackknife Euclidean-likelihood confidence intervals for the off-diagonal Spearman correlations.
conf_level: Confidence level used when ci = TRUE. Default is 0.95.
x: An object of class summary.spearman_rho.
digits: Integer; number of decimal places to print.
n: Optional row threshold for compact preview output.
topn: Optional number of leading/trailing rows to show when truncated.
max_vars: Optional maximum number of visible columns; NULL derives this from console width.
width: Optional display width; defaults to getOption("width").
ci_digits: Integer; digits for Spearman confidence limits in the pairwise summary.
show_ci: One of "yes" or "no".
...: Additional arguments passed to ggplot2::theme() or other ggplot2 layers.
title: Plot title. Default is "Spearman's rank correlation heatmap".
low_color: Color for the minimum rho value. Default is "indianred1".
high_color: Color for the maximum rho value. Default is "steelblue1".
mid_color: Color for zero correlation. Default is "white".
value_text_size: Font size for displaying correlation values. Default is 4.
ci_text_size: Text size for confidence intervals in the heatmap.
show_value: Logical; if TRUE (default), overlay numeric values on the heatmap tiles.
object: An object of class spearman_rho.

Author

Thiago de Paula Oliveira

Details

For each column $j=1,\ldots,p$, let $R_{\cdot j} \in \{1,\ldots,n\}^n$ denote the (mid-)ranks of $X_{\cdot j}$, assigning average ranks to ties. The mean rank is $\bar R_j = (n+1)/2$ regardless of ties. Define the centred rank vectors $\tilde R_{\cdot j} = R_{\cdot j} - \bar R_j \mathbf{1}$, where $\mathbf{1}\in\mathbb{R}^n$ is the all-ones vector. The Spearman correlation between columns $i$ and $j$ is the Pearson correlation of their rank vectors: $$ \rho_S(i,j) \;=\; \frac{\sum_{k=1}^n (R_{ki}-\bar R_i)(R_{kj}-\bar R_j)} {\sqrt{\sum_{k=1}^n (R_{ki}-\bar R_i)^2}\; \sqrt{\sum_{k=1}^n (R_{kj}-\bar R_j)^2}}. $$ In matrix form, with $R=[R_{\cdot 1},\ldots,R_{\cdot p}]$, $\mu=(n+1)\mathbf{1}_p/2$ for $\mathbf{1}_p\in\mathbb{R}^p$, and $S_R=\bigl(R-\mathbf{1}\mu^\top\bigr)^\top \bigl(R-\mathbf{1}\mu^\top\bigr)/(n-1)$, the Spearman correlation matrix is $$ \widehat{\rho}_S \;=\; D^{-1/2} S_R D^{-1/2}, \qquad D \;=\; \mathrm{diag}(\mathrm{diag}(S_R)). $$ When there are no ties, the familiar rank-difference formula obtains $$ \rho_S(i,j) \;=\; 1 - \frac{6}{n(n^2-1)} \sum_{k=1}^n d_k^2, \quad d_k \;=\; R_{ki}-R_{kj}, $$ but this expression does not hold under ties; computing Pearson on mid-ranks (as above) is the standard tie-robust approach. Without ties, $\mathrm{Var}(R_{\cdot j})=(n^2-1)/12$; with ties, the variance is smaller.

$\rho_S(i,j) \in [-1,1]$ and $\widehat{\rho}_S$ is symmetric positive semi-definite by construction (up to floating-point error). The implementation symmetrises the result to remove round-off asymmetry. Spearman's correlation is invariant to strictly monotone transformations applied separately to each variable.

Computation. Each column is ranked (mid-ranks) to form $R$. The product $R^\top R$ is computed via a 'BLAS' symmetric rank update ('SYRK'), and centred using $$ (R-\mathbf{1}\mu^\top)^\top (R-\mathbf{1}\mu^\top) \;=\; R^\top R \;-\; n\,\mu\mu^\top, $$ avoiding an explicit centred copy. Division by $n-1$ yields the sample covariance of ranks; standardising by $D^{-1/2}$ gives $\widehat{\rho}_S$. Columns with zero rank variance (all values equal) are returned as NA along their row/column; the corresponding diagonal entry is also NA.

When check_na = FALSE, each $(i,j)$ estimate is recomputed on the pairwise complete-case overlap of columns $i$ and $j$. When ci = TRUE, confidence intervals are computed in 'C++' using the jackknife Euclidean-likelihood method of de Carvalho and Marques (2012). For a pairwise estimate $U = \hat\rho_S$, delete-one jackknife pseudo-values are formed as $$ Z_i = nU - (n-1)U_{(-i)}, \qquad i = 1,\ldots,n, $$ where $U_{(-i)}$ is the Spearman correlation after removing observation $i$. The confidence limits solve $$ \frac{n(U-\theta)^2}{n^{-1}\sum_{i=1}^n (Z_i - \theta)^2} = \chi^2_{1,\;\texttt{conf\_level}}. $$

Ranking costs $O\!\bigl(p\,n\log n\bigr)$; forming and normalising $R^\top R$ costs $O\!\bigl(n p^2\bigr)$ with $O(p^2)$ additional memory. The optional jackknife Euclidean-likelihood confidence intervals add per-pair delete-one recomputation work and are intended for inference rather than raw-matrix throughput.

References

Spearman, C. (1904). The proof and measurement of association between two things. International Journal of Epidemiology, 39(5), 1137-1150.

de Carvalho, M., & Marques, F. (2012). Jackknife Euclidean likelihood-based inference for Spearman's rho. North American Actuarial Journal, 16(4), 487-492.

Examples

Run this code

## Monotone transformation invariance (Spearman is rank-based)
set.seed(123)
n <- 400; p <- 6; rho <- 0.6
Sigma <- rho^abs(outer(seq_len(p), seq_len(p), "-"))
L <- chol(Sigma)
X <- matrix(rnorm(n * p), n, p) %*% L
colnames(X) <- paste0("V", seq_len(p))

X_mono <- X
X_mono[, 1] <- exp(X_mono[, 1])
X_mono[, 2] <- log1p(exp(X_mono[, 2]))
X_mono[, 3] <- X_mono[, 3]^3

sp_X <- spearman_rho(X)
sp_m <- spearman_rho(X_mono)
summary(sp_X)
round(max(abs(sp_X - sp_m)), 3)
plot(sp_X)

## Confidence intervals
sp_ci <- spearman_rho(X[, 1:3], ci = TRUE)
print(sp_ci, show_ci = "yes")
summary(sp_ci)

## Ties handled via mid-ranks
tied <- cbind(
  a = rep(1:5, each = 20),
  b = rep(5:1, each = 20) + rnorm(100, sd = 0.1),
  c = as.numeric(gl(10, 10))
)
sp_tied <- spearman_rho(tied, ci = TRUE)
print(sp_tied, digits = 2, show_ci = "yes")

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(sp_X)
}

Run the code above in your browser using DataLab