skipped_corr: Pairwise skipped correlation

Description

Computes all pairwise skipped correlation coefficients for the numeric columns of a matrix or data frame using a high-performance 'C++' backend.

Skipped correlation detects bivariate outliers using a projection rule and then computes Pearson or Spearman correlation on the retained observations. It is designed for situations where marginally robust methods can still be distorted by unusual points in the joint data cloud.

Usage

skipped_corr(
  data,
  method = c("pearson", "spearman"),
  na_method = c("error", "pairwise"),
  stand = TRUE,
  outlier_rule = c("idealf", "mad"),
  cutoff = sqrt(stats::qchisq(0.975, df = 2)),
  n_threads = getOption("matrixCorr.threads", 1L),
  return_masks = FALSE,
  ci = FALSE,
  p_value = FALSE,
  conf_level = 0.95,
  n_boot = 2000L,
  p_adjust = c("none", "hochberg", "ecp"),
  fwe_level = 0.05,
  n_mc = 1000L,
  seed = NULL
)
skipped_corr_masks(x, var1 = NULL, var2 = NULL)
# S3 method for skipped_corr
print(
  x,
  digits = 4,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 4,
  show_ci = NULL,
  show_p = c("auto", "yes", "no"),
  ...
)
# S3 method for skipped_corr
plot(
  x,
  title = "Skipped correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  ci_text_size = 3,
  show_value = TRUE,
  ...
)
# S3 method for skipped_corr
summary(
  object,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)
# S3 method for summary.skipped_corr
print(
  x,
  digits = NULL,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

Value

A symmetric correlation matrix with class skipped_corr and attributes method = "skipped_correlation", description, and package = "matrixCorr". When return_masks = TRUE, the matrix also carries a skipped_masks attribute containing compact pairwise skipped-row indices. The diagnostics attribute stores per-pair complete-case counts and skipped-row counts/proportions. When ci = TRUE or p_value = TRUE, bootstrap inference matrices are attached via attributes.

Arguments

data: A numeric matrix or a data frame with at least two numeric columns. All non-numeric columns will be excluded.
method: Correlation computed after removing projected outliers. One of "pearson" (default) or "spearman".
na_method: One of "error" (default) or "pairwise". With "error", the function requires all retained numeric columns to be free of missing or non-finite values and aborts otherwise. This is the recommended setting when you want a single common sample size across all pairs, reproducible skipped-row diagnostics on the same rows, or bootstrap inference via ci = TRUE / p_value = TRUE. With "pairwise", each variable pair is computed on its own overlap of finite rows. This is more permissive for incomplete data, but different pairs can be based on different effective samples and different skipped-row sets, so the resulting matrix is less directly comparable across entries.
stand: Logical; if TRUE (default), each variable in the pair is centred by its median and divided by a robust scale estimate before the projection outlier search. The scale estimate is the MAD when positive, with fallback to $\mathrm{IQR}/1.34898$ and then the usual sample standard deviation if needed. This standardisation affects only outlier detection, not the final correlation computed on the retained observations.
outlier_rule: One of "idealf" (default) or "mad". The default uses the ideal-fourths interquartile width of projected distances; "mad" uses the median absolute deviation of projected distances.
cutoff: Positive numeric constant multiplying the projected spread in the outlier rule $\mathrm{med}(d_{i\cdot}) + cutoff \times s(d_{i\cdot})$. Larger values flag fewer observations as outliers; smaller values flag more. Default sqrt(qchisq(0.975, df = 2)).
n_threads: Integer $\geq 1$. Number of OpenMP threads. Defaults to getOption("matrixCorr.threads", 1L).
return_masks: Logical; if TRUE, attach compact pairwise skipped-row indices as an attribute. Default FALSE.
ci: Logical; if TRUE, attach percentile-bootstrap confidence intervals for each skipped correlation using the Wilcox (2015) B2 resampling scheme. Default FALSE.
p_value: Logical; if TRUE, attach bootstrap p-values for testing whether each skipped correlation is zero. Default FALSE.
conf_level: Confidence level used when ci = TRUE. Default 0.95.
n_boot: Integer $\geq 2$. Number of bootstrap resamples used when ci = TRUE or p_value = TRUE. Default 2000.
p_adjust: One of "none" (default), "hochberg", or "ecp". Optional familywise-error procedure applied to the matrix of bootstrap p-values. "hochberg" corresponds to method H in Wilcox, Rousselet, and Pernet (2018); "ecp" corresponds to their simulated critical-p-value method ECP.
fwe_level: Familywise-error level used when p_adjust = "hochberg" or "ecp". Default 0.05.
n_mc: Integer $\geq 10$. Number of null Monte Carlo data sets used when p_adjust = "ecp" to estimate the critical p-value. Default 1000.
seed: Optional positive integer used to seed the bootstrap resampling when ci = TRUE or p_value = TRUE. If NULL, a fresh internal seed is generated.
x: An object of class summary.skipped_corr.
var1, var2: Optional column names or 1-based column indices used by skipped_corr_masks() to extract the skipped-row indices for one pair.
digits: Integer; number of digits to print.
n: Optional row threshold for compact preview output.
topn: Optional number of leading/trailing rows to show when truncated.
max_vars: Optional maximum number of visible columns; NULL derives this from console width.
width: Optional display width; defaults to getOption("width").
ci_digits: Integer; digits for skipped-correlation confidence limits.
show_ci: One of "yes" or "no".
show_p: One of "auto", "yes", "no". For print(), "auto" keeps the compact matrix-only display; use "yes" to also print pairwise p-values.
...: Additional arguments passed to the underlying print or plot helper.
title: Character; plot title.
low_color, high_color, mid_color: Colors used in the heatmap.
value_text_size: Numeric text size for overlaid cell values.
ci_text_size: Text size for confidence intervals in the heatmap.
show_value: Logical; if TRUE (default), overlay numeric values on the heatmap tiles.
object: An object of class skipped_corr.

Author

Thiago de Paula Oliveira

Details

Let $X \in \mathbb{R}^{n \times p}$ be a numeric matrix with rows as observations and columns as variables. For a given pair of columns $(x, y)$, write the observed bivariate points as $u_i = (x_i, y_i)^\top$, $i=1,\ldots,n$. If stand = TRUE, each margin is first centred by its median and divided by a robust scale estimate before outlier detection; otherwise the original pair is used. The robust scale is the MAD when positive, with fallback to $\mathrm{IQR}/1.34898$ and then the usual sample standard deviation if needed. Let $\tilde u_i$ denote the resulting points and let $c$ be the componentwise median center of the detection cloud.

For each observation $i$, define the direction vector $b_i = \tilde u_i - c$. When $\|b_i\| > 0$, all observations are projected onto the line through $c$ in direction $b_i$. The projected distances are $$ d_{ij} \;=\; \frac{|(\tilde u_j - c)^\top b_i|}{\|b_i\|}, \qquad j=1,\ldots,n. $$ For each direction $i$, observation $j$ is flagged as an outlier if

$$ d_{ij} \;>\; \mathrm{med}(d_{i\cdot}) + g\, s(d_{i\cdot}), \qquad g = \code{cutoff}, $$ where $s(\cdot)$ is either the ideal-fourths interquartile width (outlier_rule = "idealf") or the median absolute deviation (outlier_rule = "mad"). An observation is removed if it is flagged for at least one projection direction. The skipped correlation is then the ordinary Pearson or Spearman correlation computed from the retained observations: $$ r_{\mathrm{skip}}(x,y) \;=\; \mathrm{cor}\!\left(x_{\mathcal{K}}, y_{\mathcal{K}}\right), $$ where $\mathcal{K}$ is the index set of observations not flagged as outliers.

Unlike marginally robust methods such as pbcor(), wincor(), or bicor(), skipped correlation is explicitly pairwise because outlier detection depends on the joint geometry of each variable pair. As a result, the reported matrix need not be positive semidefinite, even with complete data.

Computational notes. In the complete-data path, each column pair requires a full bivariate projection search, so the dominant cost is higher than for marginal robust methods. The implementation evaluates pairs in 'C++'; where available, pairs are processed with 'OpenMP' parallelism. With na_method = "pairwise", each pair is recomputed on its overlap of non-missing rows.

Bootstrap inference. When ci = TRUE or p_value = TRUE, the implementation uses the percentile-bootstrap strategy studied by Wilcox (2015). Each bootstrap replicate resamples whole observation pairs with replacement, reruns the skipped-correlation outlier detection on the resampled data, and recomputes the skipped correlation on the retained observations. This corresponds to Wilcox's B2 method and avoids the statistically unsatisfactory shortcut of removing outliers only once before bootstrapping. Bootstrap inference currently requires complete data (na_method = "error"). When p_adjust = "hochberg", the bootstrap p-values are processed with Hochberg's step-up procedure (method H in Wilcox, Rousselet, and Pernet, 2018). When p_adjust = "ecp", the package follows their ECP method and simulates n_mc null data sets from a $p$-variate normal distribution with identity covariance, recomputes the pairwise bootstrap p-values for each null data set, stores the minimum p-value from each run, and estimates the fwe_level quantile of that null distribution using the Harrell-Davis estimator. Hypotheses are then rejected when their observed bootstrap p-values are less than or equal to the estimated critical p-value. The calibrated H1 procedure from Wilcox, Rousselet, and Pernet (2018) is not currently implemented.

References

Wilcox, R. R. (2004). Inferences based on a skipped correlation coefficient. Journal of Applied Statistics, 31(2), 131-143. tools:::Rd_expr_doi("10.1080/0266476032000148821")

Wilcox, R. R. (2015). Inferences about the skipped correlation coefficient: Dealing with heteroscedasticity and non-normality. Journal of Modern Applied Statistical Methods, 14(1), 172-188. tools:::Rd_expr_doi("10.22237/jmasm/1430453580")

Wilcox, R. R., Rousselet, G. A., & Pernet, C. R. (2018). Improved methods for making inferences about multiple skipped correlations. Journal of Statistical Computation and Simulation, 88(16), 3116-3131. tools:::Rd_expr_doi("10.1080/00949655.2018.1501051")

Examples

Run this code

set.seed(12)
X <- matrix(rnorm(160 * 4), ncol = 4)
X[1, 1] <- 9
X[1, 2] <- -8

R <- skipped_corr(X, method = "pearson")
print(R, digits = 2)
summary(R)
plot(R)

Rm <- skipped_corr(X, method = "pearson", return_masks = TRUE)
skipped_corr_masks(Rm, 1, 2)

# Example 1:
Xm <- as.matrix(datasets::mtcars[, c("mpg", "disp", "hp", "wt")])
Rm2 <- skipped_corr(Xm, method = "spearman")
print(Rm2, digits = 2)

# Example 2:
Ri <- skipped_corr(Xm, method = "pearson", ci = TRUE, n_boot = 40, seed = 1)
Ri$ci

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(R)
}

Run the code above in your browser using DataLab