Learn R Programming

matrixCorr (version 0.10.0)

spearman_rho: Pairwise Spearman's rank correlation

Description

Computes pairwise Spearman's rank correlations for the numeric columns of a matrix or data frame using a high-performance 'C++' backend. Optional confidence intervals are available via a jackknife Euclidean-likelihood method.

Usage

spearman_rho(data, check_na = TRUE, ci = FALSE, conf_level = 0.95)

# S3 method for spearman_rho print( x, digits = 4, n = NULL, topn = NULL, max_vars = NULL, width = NULL, ci_digits = 3, show_ci = NULL, ... )

# S3 method for spearman_rho plot( x, title = "Spearman's rank correlation heatmap", low_color = "indianred1", high_color = "steelblue1", mid_color = "white", value_text_size = 4, ci_text_size = 3, show_value = TRUE, ... )

# S3 method for spearman_rho summary( object, n = NULL, topn = NULL, max_vars = NULL, width = NULL, ci_digits = 3, show_ci = NULL, ... )

# S3 method for summary.spearman_rho print( x, digits = NULL, n = NULL, topn = NULL, max_vars = NULL, width = NULL, show_ci = NULL, ... )

Value

A symmetric numeric matrix where the (i, j)-th element is the Spearman correlation between the i-th and j-th numeric columns of the input. When ci = TRUE, the object also carries a ci attribute with elements est, lwr.ci, upr.ci, and conf.level. When pairwise-complete evaluation is used, pairwise sample sizes are stored in attr(x, "diagnostics")$n_complete.

Invisibly returns the spearman_rho object.

A ggplot object representing the heatmap.

Arguments

data

A numeric matrix or a data frame with at least two numeric columns. All non-numeric columns will be excluded. Each column must have at least two non-missing values.

check_na

Logical (default TRUE). If TRUE, the input is required to be free of NA/NaN/Inf. Set to FALSE only when the caller already handled missingness.

ci

Logical (default FALSE). If TRUE, attach jackknife Euclidean-likelihood confidence intervals for the off-diagonal Spearman correlations.

conf_level

Confidence level used when ci = TRUE. Default is 0.95.

x

An object of class summary.spearman_rho.

digits

Integer; number of decimal places to print.

n

Optional row threshold for compact preview output.

topn

Optional number of leading/trailing rows to show when truncated.

max_vars

Optional maximum number of visible columns; NULL derives this from console width.

width

Optional display width; defaults to getOption("width").

ci_digits

Integer; digits for Spearman confidence limits in the pairwise summary.

show_ci

One of "yes" or "no".

...

Additional arguments passed to ggplot2::theme() or other ggplot2 layers.

title

Plot title. Default is "Spearman's rank correlation heatmap".

low_color

Color for the minimum rho value. Default is "indianred1".

high_color

Color for the maximum rho value. Default is "steelblue1".

mid_color

Color for zero correlation. Default is "white".

value_text_size

Font size for displaying correlation values. Default is 4.

ci_text_size

Text size for confidence intervals in the heatmap.

show_value

Logical; if TRUE (default), overlay numeric values on the heatmap tiles.

object

An object of class spearman_rho.

Author

Thiago de Paula Oliveira

Details

For each column \(j=1,\ldots,p\), let \(R_{\cdot j} \in \{1,\ldots,n\}^n\) denote the (mid-)ranks of \(X_{\cdot j}\), assigning average ranks to ties. The mean rank is \(\bar R_j = (n+1)/2\) regardless of ties. Define the centred rank vectors \(\tilde R_{\cdot j} = R_{\cdot j} - \bar R_j \mathbf{1}\), where \(\mathbf{1}\in\mathbb{R}^n\) is the all-ones vector. The Spearman correlation between columns \(i\) and \(j\) is the Pearson correlation of their rank vectors: $$ \rho_S(i,j) \;=\; \frac{\sum_{k=1}^n (R_{ki}-\bar R_i)(R_{kj}-\bar R_j)} {\sqrt{\sum_{k=1}^n (R_{ki}-\bar R_i)^2}\; \sqrt{\sum_{k=1}^n (R_{kj}-\bar R_j)^2}}. $$ In matrix form, with \(R=[R_{\cdot 1},\ldots,R_{\cdot p}]\), \(\mu=(n+1)\mathbf{1}_p/2\) for \(\mathbf{1}_p\in\mathbb{R}^p\), and \(S_R=\bigl(R-\mathbf{1}\mu^\top\bigr)^\top \bigl(R-\mathbf{1}\mu^\top\bigr)/(n-1)\), the Spearman correlation matrix is $$ \widehat{\rho}_S \;=\; D^{-1/2} S_R D^{-1/2}, \qquad D \;=\; \mathrm{diag}(\mathrm{diag}(S_R)). $$ When there are no ties, the familiar rank-difference formula obtains $$ \rho_S(i,j) \;=\; 1 - \frac{6}{n(n^2-1)} \sum_{k=1}^n d_k^2, \quad d_k \;=\; R_{ki}-R_{kj}, $$ but this expression does not hold under ties; computing Pearson on mid-ranks (as above) is the standard tie-robust approach. Without ties, \(\mathrm{Var}(R_{\cdot j})=(n^2-1)/12\); with ties, the variance is smaller.

\(\rho_S(i,j) \in [-1,1]\) and \(\widehat{\rho}_S\) is symmetric positive semi-definite by construction (up to floating-point error). The implementation symmetrises the result to remove round-off asymmetry. Spearman's correlation is invariant to strictly monotone transformations applied separately to each variable.

Computation. Each column is ranked (mid-ranks) to form \(R\). The product \(R^\top R\) is computed via a 'BLAS' symmetric rank update ('SYRK'), and centred using $$ (R-\mathbf{1}\mu^\top)^\top (R-\mathbf{1}\mu^\top) \;=\; R^\top R \;-\; n\,\mu\mu^\top, $$ avoiding an explicit centred copy. Division by \(n-1\) yields the sample covariance of ranks; standardising by \(D^{-1/2}\) gives \(\widehat{\rho}_S\). Columns with zero rank variance (all values equal) are returned as NA along their row/column; the corresponding diagonal entry is also NA.

When check_na = FALSE, each \((i,j)\) estimate is recomputed on the pairwise complete-case overlap of columns \(i\) and \(j\). When ci = TRUE, confidence intervals are computed in 'C++' using the jackknife Euclidean-likelihood method of de Carvalho and Marques (2012). For a pairwise estimate \(U = \hat\rho_S\), delete-one jackknife pseudo-values are formed as $$ Z_i = nU - (n-1)U_{(-i)}, \qquad i = 1,\ldots,n, $$ where \(U_{(-i)}\) is the Spearman correlation after removing observation \(i\). The confidence limits solve $$ \frac{n(U-\theta)^2}{n^{-1}\sum_{i=1}^n (Z_i - \theta)^2} = \chi^2_{1,\;\texttt{conf\_level}}. $$

Ranking costs \(O\!\bigl(p\,n\log n\bigr)\); forming and normalising \(R^\top R\) costs \(O\!\bigl(n p^2\bigr)\) with \(O(p^2)\) additional memory. The optional jackknife Euclidean-likelihood confidence intervals add per-pair delete-one recomputation work and are intended for inference rather than raw-matrix throughput.

References

Spearman, C. (1904). The proof and measurement of association between two things. International Journal of Epidemiology, 39(5), 1137-1150.

de Carvalho, M., & Marques, F. (2012). Jackknife Euclidean likelihood-based inference for Spearman's rho. North American Actuarial Journal, 16(4), 487-492.

See Also

print.spearman_rho, plot.spearman_rho

Examples

Run this code
## Monotone transformation invariance (Spearman is rank-based)
set.seed(123)
n <- 400; p <- 6; rho <- 0.6
Sigma <- rho^abs(outer(seq_len(p), seq_len(p), "-"))
L <- chol(Sigma)
X <- matrix(rnorm(n * p), n, p) %*% L
colnames(X) <- paste0("V", seq_len(p))

X_mono <- X
X_mono[, 1] <- exp(X_mono[, 1])
X_mono[, 2] <- log1p(exp(X_mono[, 2]))
X_mono[, 3] <- X_mono[, 3]^3

sp_X <- spearman_rho(X)
sp_m <- spearman_rho(X_mono)
summary(sp_X)
round(max(abs(sp_X - sp_m)), 3)
plot(sp_X)

## Confidence intervals
sp_ci <- spearman_rho(X[, 1:3], ci = TRUE)
print(sp_ci, show_ci = "yes")
summary(sp_ci)

## Ties handled via mid-ranks
tied <- cbind(
  a = rep(1:5, each = 20),
  b = rep(5:1, each = 20) + rnorm(100, sd = 0.1),
  c = as.numeric(gl(10, 10))
)
sp_tied <- spearman_rho(tied, ci = TRUE)
print(sp_tied, digits = 2, show_ci = "yes")

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(sp_X)
}

Run the code above in your browser using DataLab