Learn R Programming

matrixCorr (version 0.8.3)

pearson_corr: Pairwise Pearson correlation

Description

Computes all pairwise Pearson correlation coefficients for the numeric columns of a matrix or data frame using a high-performance 'C++' backend.

This function uses a direct Pearson formula implementation in 'C++' to achieve fast and scalable correlation computations, especially for large datasets.

Prints a summary of the Pearson correlation matrix, including description and method metadata.

Generates a ggplot2-based heatmap of the Pearson correlation matrix.

Usage

pearson_corr(data, check_na = TRUE)

# S3 method for pearson_corr print(x, digits = 4, max_rows = NULL, max_cols = NULL, ...)

# S3 method for pearson_corr plot( x, title = "Pearson correlation heatmap", low_color = "indianred1", high_color = "steelblue1", mid_color = "white", value_text_size = 4, ... )

Value

A symmetric numeric matrix where the (i, j)-th element is the Pearson correlation between the i-th and j-th numeric columns of the input.

Invisibly returns the pearson_corr object.

A ggplot object representing the heatmap.

Arguments

data

A numeric matrix or a data frame with at least two numeric columns. All non-numeric columns will be excluded. Each column must have at least two non-missing values.

check_na

Logical (default TRUE). If TRUE, inputs must be free of NA/NaN/Inf. Set to FALSE only when the caller already handled missingness.

x

An object of class pearson_corr.

digits

Integer; number of decimal places to print in the concordance

max_rows

Optional integer; maximum number of rows to display. If NULL, all rows are shown.

max_cols

Optional integer; maximum number of columns to display. If NULL, all columns are shown.

...

Additional arguments passed to ggplot2::theme() or other ggplot2 layers.

title

Plot title. Default is "Pearson correlation heatmap".

low_color

Color for the minimum correlation. Default is "indianred1".

high_color

Color for the maximum correlation. Default is "steelblue1".

mid_color

Color for zero correlation. Default is "white".

value_text_size

Font size for displaying correlation values. Default is 4.

Author

Thiago de Paula Oliveira

Details

Let \(X \in \mathbb{R}^{n \times p}\) be a numeric matrix with rows as observations and columns as variables, and let \(\mathbf{1} \in \mathbb{R}^n\) denote the all-ones vector. Define the column means \(\mu = (1/n)\,\mathbf{1}^\top X\) and the centred cross-product matrix $$ S \;=\; (X - \mathbf{1}\mu)^\top (X - \mathbf{1}\mu) \;=\; X^\top \!\Big(I_n - \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top\Big) X \;=\; X^\top X \;-\; n\,\mu\,\mu^\top. $$ The (unbiased) sample covariance is $$ \widehat{\Sigma} \;=\; \tfrac{1}{n-1}\,S, $$ and the sample standard deviations are \(s_i = \sqrt{\widehat{\Sigma}_{ii}}\). The Pearson correlation matrix is obtained by standardising \(\widehat{\Sigma}\), and it is given by $$ R \;=\; D^{-1/2}\,\widehat{\Sigma}\,D^{-1/2}, \qquad D \;=\; \mathrm{diag}(\widehat{\Sigma}_{11},\ldots,\widehat{\Sigma}_{pp}), $$ equivalently, entrywise \(R_{ij} = \widehat{\Sigma}_{ij}/(s_i s_j)\) for \(i \neq j\) and \(R_{ii} = 1\). With \(1/(n-1)\) scaling, \(\widehat{\Sigma}\) is unbiased for the covariance; the induced correlations are biased in finite samples.

The implementation forms \(X^\top X\) via a BLAS symmetric rank-\(k\) update (SYRK) on the upper triangle, then applies the rank-1 correction \(-\,n\,\mu\,\mu^\top\) to obtain \(S\) without explicitly materialising \(X - \mathbf{1}\mu\). After scaling by \(1/(n-1)\), triangular normalisation by \(D^{-1/2}\) yields \(R\), which is then symmetrised to remove round-off asymmetry. Tiny negative values on the covariance diagonal due to floating-point rounding are truncated to zero before taking square roots.

If a variable has zero variance (\(s_i = 0\)), the corresponding row and column of \(R\) are set to NA. No missing values are permitted in \(X\); columns must have at least two distinct, non-missing values.

Computational complexity. The dominant cost is \(O(n p^2)\) flops with \(O(p^2)\) memory.

References

Pearson, K. (1895). "Notes on regression and inheritance in the case of two parents". Proceedings of the Royal Society of London, 58, 240–242.

See Also

print.pearson_corr, plot.pearson_corr

Examples

Run this code
## MVN with AR(1) correlation
set.seed(123)
p <- 6; n <- 300; rho <- 0.5
# true correlation
Sigma <- rho^abs(outer(seq_len(p), seq_len(p), "-"))
L <- chol(Sigma)
# MVN(n, 0, Sigma)
X <- matrix(rnorm(n * p), n, p) %*% L
colnames(X) <- paste0("V", seq_len(p))

pr <- pearson_corr(X)
print(pr, digits = 2)
plot(pr)

## Compare the sample estimate to the truth
Rhat <- cor(X)
# estimated
round(Rhat[1:4, 1:4], 2)
# true
round(Sigma[1:4, 1:4], 2)
off <- upper.tri(Sigma, diag = FALSE)
# MAE on off-diagonals
mean(abs(Rhat[off] - Sigma[off]))

## Larger n reduces sampling error
n2 <- 2000
X2 <- matrix(rnorm(n2 * p), n2, p) %*% L
Rhat2 <- cor(X2)
off <- upper.tri(Sigma, diag = FALSE)
## mean absolute error (MAE) of the off-diagonal correlations
mean(abs(Rhat2[off] - Sigma[off]))

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(pr)
}

Run the code above in your browser using DataLab