Learn R Programming

matrixCorr (version 0.10.0)

wincor: Pairwise Winsorized correlation

Description

Computes all pairwise Winsorized correlation coefficients for the numeric columns of a matrix or data frame using a high-performance 'C++' backend.

This function Winsorizes each margin at proportion tr and then computes ordinary Pearson correlation on the Winsorized values. It is a simple robust alternative to Pearson correlation when the main concern is unusually large or small observations in the marginal distributions.

Usage

wincor(
  data,
  tr = 0.2,
  na_method = c("error", "pairwise"),
  n_threads = getOption("matrixCorr.threads", 1L)
)

# S3 method for wincor print( x, digits = 4, n = NULL, topn = NULL, max_vars = NULL, width = NULL, show_ci = NULL, ... )

# S3 method for wincor plot( x, title = "Winsorized correlation heatmap", low_color = "indianred1", high_color = "steelblue1", mid_color = "white", value_text_size = 4, show_value = TRUE, ... )

# S3 method for wincor summary( object, n = NULL, topn = NULL, max_vars = NULL, width = NULL, show_ci = NULL, ... )

Value

A symmetric correlation matrix with class wincor and attributes method = "winsorized_correlation", description, and package = "matrixCorr".

Arguments

data

A numeric matrix or a data frame with at least two numeric columns. All non-numeric columns will be excluded.

tr

Winsorization proportion in [0, 0.5). For a sample of size \(n\), let \(g = \lfloor tr \cdot n \rfloor\); the \(g\) smallest observations are set to the \((g+1)\)-st order statistic and the \(g\) largest observations are set to the \((n-g)\)-th order statistic. Default 0.2.

na_method

One of "error" (default) or "pairwise".

n_threads

Integer \(\geq 1\). Number of OpenMP threads. Defaults to getOption("matrixCorr.threads", 1L).

x

An object of class wincor.

digits

Integer; number of digits to print.

n

Optional row threshold for compact preview output.

topn

Optional number of leading/trailing rows to show when truncated.

max_vars

Optional maximum number of visible columns; NULL derives this from console width.

width

Optional display width; defaults to getOption("width").

show_ci

One of "yes" or "no".

...

Additional arguments passed to the underlying print or plot helper.

title

Character; plot title.

low_color, high_color, mid_color

Colors used in the heatmap.

value_text_size

Numeric text size for overlaid cell values.

show_value

Logical; if TRUE (default), overlay numeric values on the heatmap tiles.

object

An object of class wincor.

Author

Thiago de Paula Oliveira

Details

Let \(X \in \mathbb{R}^{n \times p}\) be a numeric matrix with rows as observations and columns as variables. For a column \(x = (x_i)_{i=1}^n\), write the order statistics as \(x_{(1)} \le \cdots \le x_{(n)}\) and let \(g = \lfloor tr \cdot n \rfloor\). The Winsorized values can be written as $$ x_i^{(w)} \;=\; \max\!\bigl\{x_{(g+1)},\, \min(x_i, x_{(n-g)})\bigr\}. $$ For two columns \(x\) and \(y\), the Winsorized correlation is the ordinary Pearson correlation computed from \(x^{(w)}\) and \(y^{(w)}\): $$ r_w(x,y) \;=\; \frac{\sum_{i=1}^n (x_i^{(w)}-\bar x^{(w)})(y_i^{(w)}-\bar y^{(w)})} {\sqrt{\sum_{i=1}^n (x_i^{(w)}-\bar x^{(w)})^2}\; \sqrt{\sum_{i=1}^n (y_i^{(w)}-\bar y^{(w)})^2}}. $$

In matrix form, let \(X^{(w)}\) contain the Winsorized columns and define the centred, unit-norm columns $$ z_{\cdot j} = \frac{x_{\cdot j}^{(w)} - \bar x_j^{(w)} \mathbf{1}} {\sqrt{\sum_{i=1}^n (x_{ij}^{(w)}-\bar x_j^{(w)})^2}}, \qquad j=1,\ldots,p. $$ If \(Z = [z_{\cdot 1}, \ldots, z_{\cdot p}]\), then the Winsorized correlation matrix is $$ R_w \;=\; Z^\top Z. $$

Winsorization acts on each margin separately, so it guards against marginal outliers and heavy tails but does not target unusual points in the joint cloud. This implementation Winsorizes each column in 'C++', centres and normalises it, and forms the complete-data matrix from cross-products. With na_method = "pairwise", each pair is recomputed on its overlap of non-missing rows. As with Pearson correlation, the complete-data path yields a symmetric positive semidefinite matrix, whereas pairwise deletion can break positive semidefiniteness.

Computational complexity. In the complete-data path, Winsorizing the columns requires sorting within each column, and forming the cross-product matrix costs \(O(n p^2)\) with \(O(p^2)\) output storage.

References

Wilcox, R. R. (1993). Some results on a Winsorized correlation coefficient. British Journal of Mathematical and Statistical Psychology, 46(2), 339-349. tools:::Rd_expr_doi("10.1111/j.2044-8317.1993.tb01020.x")

Wilcox, R. R. (2012). Introduction to Robust Estimation and Hypothesis Testing (3rd ed.). Academic Press.

See Also

pbcor(), skipped_corr(), bicor()

Examples

Run this code
set.seed(11)
X <- matrix(rnorm(180 * 4), ncol = 4)
X[sample(length(X), 6)] <- X[sample(length(X), 6)] - 12

R <- wincor(X, tr = 0.2)
print(R, digits = 2)
summary(R)
plot(R)

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(R)
}

Run the code above in your browser using DataLab