pearson_corr: Pairwise Pearson correlation

Description

Computes all pairwise Pearson correlation coefficients for the numeric columns of a matrix or data frame using a high-performance 'C++' backend.

This function uses a direct Pearson formula implementation in 'C++' to achieve fast and scalable correlation computations, especially for large datasets.

Prints a summary of the Pearson correlation matrix, including description and method metadata.

Generates a ggplot2-based heatmap of the Pearson correlation matrix.

Usage

pearson_corr(data, check_na = TRUE)
# S3 method for pearson_corr
print(x, digits = 4, max_rows = NULL, max_cols = NULL, ...)
# S3 method for pearson_corr
plot(
  x,
  title = "Pearson correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  ...
)

Value

A symmetric numeric matrix where the (i, j)-th element is the Pearson correlation between the i-th and j-th numeric columns of the input.

Invisibly returns the pearson_corr object.

A ggplot object representing the heatmap.

Arguments

data: A numeric matrix or a data frame with at least two numeric columns. All non-numeric columns will be excluded. Each column must have at least two non-missing values.
check_na: Logical (default TRUE). If TRUE, inputs must be free of NA/NaN/Inf. Set to FALSE only when the caller already handled missingness.
x: An object of class pearson_corr.
digits: Integer; number of decimal places to print in the concordance
max_rows: Optional integer; maximum number of rows to display. If NULL, all rows are shown.
max_cols: Optional integer; maximum number of columns to display. If NULL, all columns are shown.
...: Additional arguments passed to ggplot2::theme() or other ggplot2 layers.
title: Plot title. Default is "Pearson correlation heatmap".
low_color: Color for the minimum correlation. Default is "indianred1".
high_color: Color for the maximum correlation. Default is "steelblue1".
mid_color: Color for zero correlation. Default is "white".
value_text_size: Font size for displaying correlation values. Default is 4.

Author

Thiago de Paula Oliveira

Details

Let $X \in \mathbb{R}^{n \times p}$ be a numeric matrix with rows as observations and columns as variables, and let $\mathbf{1} \in \mathbb{R}^n$ denote the all-ones vector. Define the column means $\mu = (1/n)\,\mathbf{1}^\top X$ and the centred cross-product matrix $$ S \;=\; (X - \mathbf{1}\mu)^\top (X - \mathbf{1}\mu) \;=\; X^\top \!\Big(I_n - \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top\Big) X \;=\; X^\top X \;-\; n\,\mu\,\mu^\top. $$ The (unbiased) sample covariance is $$ \widehat{\Sigma} \;=\; \tfrac{1}{n-1}\,S, $$ and the sample standard deviations are $s_i = \sqrt{\widehat{\Sigma}_{ii}}$. The Pearson correlation matrix is obtained by standardising $\widehat{\Sigma}$, and it is given by $$ R \;=\; D^{-1/2}\,\widehat{\Sigma}\,D^{-1/2}, \qquad D \;=\; \mathrm{diag}(\widehat{\Sigma}_{11},\ldots,\widehat{\Sigma}_{pp}), $$ equivalently, entrywise $R_{ij} = \widehat{\Sigma}_{ij}/(s_i s_j)$ for $i \neq j$ and $R_{ii} = 1$. With $1/(n-1)$ scaling, $\widehat{\Sigma}$ is unbiased for the covariance; the induced correlations are biased in finite samples.

The implementation forms $X^\top X$ via a BLAS symmetric rank-$k$ update (SYRK) on the upper triangle, then applies the rank-1 correction $-\,n\,\mu\,\mu^\top$ to obtain $S$ without explicitly materialising $X - \mathbf{1}\mu$. After scaling by $1/(n-1)$, triangular normalisation by $D^{-1/2}$ yields $R$, which is then symmetrised to remove round-off asymmetry. Tiny negative values on the covariance diagonal due to floating-point rounding are truncated to zero before taking square roots.

If a variable has zero variance ($s_i = 0$), the corresponding row and column of $R$ are set to NA. No missing values are permitted in $X$; columns must have at least two distinct, non-missing values.

Computational complexity. The dominant cost is $O(n p^2)$ flops with $O(p^2)$ memory.

References

Pearson, K. (1895). "Notes on regression and inheritance in the case of two parents". Proceedings of the Royal Society of London, 58, 240–242.

Examples

Run this code

## MVN with AR(1) correlation
set.seed(123)
p <- 6; n <- 300; rho <- 0.5
# true correlation
Sigma <- rho^abs(outer(seq_len(p), seq_len(p), "-"))
L <- chol(Sigma)
# MVN(n, 0, Sigma)
X <- matrix(rnorm(n * p), n, p) %*% L
colnames(X) <- paste0("V", seq_len(p))

pr <- pearson_corr(X)
print(pr, digits = 2)
plot(pr)

## Compare the sample estimate to the truth
Rhat <- cor(X)
# estimated
round(Rhat[1:4, 1:4], 2)
# true
round(Sigma[1:4, 1:4], 2)
off <- upper.tri(Sigma, diag = FALSE)
# MAE on off-diagonals
mean(abs(Rhat[off] - Sigma[off]))

## Larger n reduces sampling error
n2 <- 2000
X2 <- matrix(rnorm(n2 * p), n2, p) %*% L
Rhat2 <- cor(X2)
off <- upper.tri(Sigma, diag = FALSE)
## mean absolute error (MAE) of the off-diagonal correlations
mean(abs(Rhat2[off] - Sigma[off]))

Run the code above in your browser using DataLab