pearson_corr: Pairwise Pearson correlation

Description

Computes pairwise Pearson correlations for the numeric columns of a matrix or data frame using a high-performance 'C++' backend. Optional Fisher-z confidence intervals are available.

Usage

pearson_corr(data, check_na = TRUE, ci = FALSE, conf_level = 0.95)
# S3 method for pearson_corr
print(
  x,
  digits = 4,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  show_ci = NULL,
  ...
)
# S3 method for pearson_corr
plot(
  x,
  title = "Pearson correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  ci_text_size = 3,
  show_value = TRUE,
  ...
)
# S3 method for pearson_corr
summary(
  object,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  show_ci = NULL,
  ...
)
# S3 method for summary.pearson_corr
print(
  x,
  digits = NULL,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

Value

A symmetric numeric matrix where the (i, j)-th element is the Pearson correlation between the i-th and j-th numeric columns of the input. When ci = TRUE, the object also carries a ci attribute with elements est, lwr.ci, upr.ci, and conf.level. When pairwise-complete evaluation is used, pairwise sample sizes are stored in attr(x, "diagnostics")$n_complete.

Invisibly returns the pearson_corr object.

A ggplot object representing the heatmap.

Arguments

data: A numeric matrix or a data frame with at least two numeric columns. All non-numeric columns will be excluded. Each column must have at least two non-missing values.
check_na: Logical (default TRUE). If TRUE, inputs must be free of NA/NaN/Inf. Set to FALSE only when the caller already handled missingness.
ci: Logical (default FALSE). If TRUE, attach pairwise Fisher-$z$ confidence intervals for the off-diagonal Pearson correlations.
conf_level: Confidence level used when ci = TRUE. Default is 0.95.
x: An object of class summary.pearson_corr.
digits: Integer; number of decimal places to print in the concordance
n: Optional row threshold for compact preview output.
topn: Optional number of leading/trailing rows to show when truncated.
max_vars: Optional maximum number of visible columns; NULL derives this from console width.
width: Optional display width; defaults to getOption("width").
ci_digits: Integer; digits for Pearson confidence limits in the pairwise summary.
show_ci: One of "yes" or "no".
...: Additional arguments passed to ggplot2::theme() or other ggplot2 layers.
title: Plot title. Default is "Pearson correlation heatmap".
low_color: Color for the minimum correlation. Default is "indianred1".
high_color: Color for the maximum correlation. Default is "steelblue1".
mid_color: Color for zero correlation. Default is "white".
value_text_size: Font size for displaying correlation values. Default is 4.
ci_text_size: Text size for confidence intervals in the heatmap.
show_value: Logical; if TRUE (default), overlay numeric values on the heatmap tiles.
object: An object of class pearson_corr.

Author

Thiago de Paula Oliveira

Details

Let $X \in \mathbb{R}^{n \times p}$ be a numeric matrix with rows as observations and columns as variables, and let $\mathbf{1} \in \mathbb{R}^n$ denote the all-ones vector. Define the column means $\mu = (1/n)\,\mathbf{1}^\top X$ and the centred cross-product matrix $$ S \;=\; (X - \mathbf{1}\mu)^\top (X - \mathbf{1}\mu) \;=\; X^\top \!\Big(I_n - \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top\Big) X \;=\; X^\top X \;-\; n\,\mu\,\mu^\top. $$ The (unbiased) sample covariance is $$ \widehat{\Sigma} \;=\; \tfrac{1}{n-1}\,S, $$ and the sample standard deviations are $s_i = \sqrt{\widehat{\Sigma}_{ii}}$. The Pearson correlation matrix is obtained by standardising $\widehat{\Sigma}$, and it is given by $$ R \;=\; D^{-1/2}\,\widehat{\Sigma}\,D^{-1/2}, \qquad D \;=\; \mathrm{diag}(\widehat{\Sigma}_{11},\ldots,\widehat{\Sigma}_{pp}), $$ equivalently, entrywise $R_{ij} = \widehat{\Sigma}_{ij}/(s_i s_j)$ for $i \neq j$ and $R_{ii} = 1$. With $1/(n-1)$ scaling, $\widehat{\Sigma}$ is unbiased for the covariance; the induced correlations are biased in finite samples.

The implementation forms $X^\top X$ via a BLAS symmetric rank-$k$ update (SYRK) on the upper triangle, then applies the rank-1 correction $-\,n\,\mu\,\mu^\top$ to obtain $S$ without explicitly materialising $X - \mathbf{1}\mu$. After scaling by $1/(n-1)$, triangular normalisation by $D^{-1/2}$ yields $R$, which is then symmetrised to remove round-off asymmetry. Tiny negative values on the covariance diagonal due to floating-point rounding are truncated to zero before taking square roots.

If a variable has zero variance ($s_i = 0$), the corresponding row and column of $R$ are set to NA. When check_na = FALSE, each $(i,j)$ correlation is recomputed on the pairwise complete-case overlap of columns $i$ and $j$.

When ci = TRUE, Fisher-$z$ confidence intervals are computed from the observed pairwise Pearson correlation $r_{ij}$ and the pairwise complete-case sample size $n_{ij}$: $$ z_{ij} = \operatorname{atanh}(r_{ij}), \qquad \operatorname{SE}(z_{ij}) = \frac{1}{\sqrt{n_{ij} - 3}}. $$ With $z_{1-\alpha/2} = \Phi^{-1}(1 - \alpha/2)$, the confidence limits are $$ \tanh\!\bigl(z_{ij} - z_{1-\alpha/2}\operatorname{SE}(z_{ij})\bigr) \;\;\text{and}\;\; \tanh\!\bigl(z_{ij} + z_{1-\alpha/2}\operatorname{SE}(z_{ij})\bigr). $$ Confidence intervals are reported only when $n_{ij} > 3$.

Computational complexity. The dominant cost is $O(n p^2)$ flops with $O(p^2)$ memory.

References

Pearson, K. (1895). "Notes on regression and inheritance in the case of two parents". Proceedings of the Royal Society of London, 58, 240–242.

Examples

Run this code

## MVN with AR(1) correlation
set.seed(123)
p <- 6; n <- 300; rho <- 0.5
# true correlation
Sigma <- rho^abs(outer(seq_len(p), seq_len(p), "-"))
L <- chol(Sigma)
# MVN(n, 0, Sigma)
X <- matrix(rnorm(n * p), n, p) %*% L
colnames(X) <- paste0("V", seq_len(p))

pr <- pearson_corr(X)
print(pr, digits = 2)
summary(pr)
plot(pr)

## Compare the sample estimate to the truth
Rhat <- cor(X)
# estimated
round(Rhat[1:4, 1:4], 2)
# true
round(Sigma[1:4, 1:4], 2)
off <- upper.tri(Sigma, diag = FALSE)
# MAE on off-diagonals
mean(abs(Rhat[off] - Sigma[off]))

## Larger n reduces sampling error
n2 <- 2000
X2 <- matrix(rnorm(n2 * p), n2, p) %*% L
Rhat2 <- cor(X2)
off <- upper.tri(Sigma, diag = FALSE)
## mean absolute error (MAE) of the off-diagonal correlations
mean(abs(Rhat2[off] - Sigma[off]))

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(pr)
}

Run the code above in your browser using DataLab