kendall_tau: Pairwise (or Two-Vector) Kendall's Tau Rank Correlation

Description

Computes pairwise Kendall's tau correlations for numeric data using a high-performance 'C++' backend. Optional confidence intervals are available for matrix and data-frame input.

Usage

kendall_tau(
  data,
  y = NULL,
  check_na = TRUE,
  ci = FALSE,
  conf_level = 0.95,
  ci_method = c("fieller", "if_el", "brown_benedetti")
)
# S3 method for kendall_matrix
print(
  x,
  digits = 4,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  show_ci = NULL,
  ...
)
# S3 method for kendall_matrix
plot(
  x,
  title = "Kendall's Tau correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  ci_text_size = 3,
  show_value = TRUE,
  ...
)
# S3 method for kendall_matrix
summary(
  object,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  show_ci = NULL,
  ...
)
# S3 method for summary.kendall_matrix
print(
  x,
  digits = NULL,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

Value

If y is NULL and data is a matrix/data frame: a symmetric numeric matrix where entry (i, j) is the Kendall's tau correlation between the i-th and j-th numeric columns. When ci = TRUE, the object also carries a ci attribute with elements est, lwr.ci, upr.ci, conf.level, and ci.method. Pairwise complete-case sample sizes are stored in attr(x, "diagnostics")$n_complete.
If y is provided (two-vector mode): a single numeric scalar, the Kendall's tau correlation between data and y.

Invisibly returns the kendall_matrix object.

A ggplot object representing the heatmap.

Arguments

data: For matrix/data frame mode, a numeric matrix or a data frame with at least two numeric columns. All non-numeric columns are excluded. For two-vector mode, a numeric vector x.
y: Optional numeric vector y of the same length as data when data is a vector. If supplied, the function computes the Kendall correlation between data and y using a low-overhead scalar path and returns a single number.
check_na: Logical (default TRUE). If TRUE, inputs must be free of missing/undefined values. Use FALSE only when missingness has already been handled upstream.
ci: Logical (default FALSE). If TRUE, attach pairwise confidence intervals for the off-diagonal Kendall correlations in matrix/data-frame mode.
conf_level: Confidence level used when ci = TRUE. Default is 0.95.
ci_method: Confidence-interval engine used when ci = TRUE. Supported Kendall methods are "fieller" (default), "brown_benedetti", and "if_el".
x: An object of class summary.kendall_matrix.
digits: Integer; number of decimal places to print.
n: Optional row threshold for compact preview output.
topn: Optional number of leading/trailing rows to show when truncated.
max_vars: Optional maximum number of visible columns; NULL derives this from console width.
width: Optional display width; defaults to getOption("width").
ci_digits: Integer; digits for Kendall confidence limits in the pairwise summary.
show_ci: One of "yes" or "no".
...: Additional arguments passed to ggplot2::theme() or other ggplot2 layers.
title: Plot title. Default is "Kendall's Tau correlation heatmap".
low_color: Color for the minimum tau value. Default is "indianred1".
high_color: Color for the maximum tau value. Default is "steelblue1".
mid_color: Color for zero correlation. Default is "white".
value_text_size: Font size for displaying correlation values. Default is 4.
ci_text_size: Text size for confidence intervals in the heatmap.
show_value: Logical; if TRUE (default), overlay numeric values on the heatmap tiles.
object: An object of class kendall_matrix.

Author

Thiago de Paula Oliveira

Details

Kendall's tau is a rank-based measure of association between two variables. For a dataset with $n$ observations on variables $X$ and $Y$, let $n_0 = n(n - 1)/2$ be the number of unordered pairs, $C$ the number of concordant pairs, and $D$ the number of discordant pairs. Let $T_x = \sum_g t_g (t_g - 1)/2$ and $T_y = \sum_h u_h (u_h - 1)/2$ be the numbers of tied pairs within $X$ and within $Y$, respectively, where $t_g$ and $u_h$ are tie-group sizes in $X$ and $Y$.

The tie-robust Kendall's tau-b is: $$ \tau_b = \frac{C - D}{\sqrt{(n_0 - T_x)\,(n_0 - T_y)}}. $$ When there are no ties ($T_x = T_y = 0$), this reduces to tau-a: $$ \tau_a = \frac{C - D}{n(n-1)/2}. $$

The function automatically handles ties. In degenerate cases where a variable is constant ($n_0 = T_x$ or $n_0 = T_y$), the tau-b denominator is zero and the correlation is undefined (returned as NA off the diagonal).

When check_na = FALSE, each $(i,j)$ estimate is recomputed on the pairwise complete-case overlap of columns $i$ and $j$. Confidence intervals use the observed pairwise-complete Kendall estimate and the same pairwise complete-case overlap.

With ci_method = "fieller", the interval is built on the Fisher-style transformed scale $z = \operatorname{atanh}(\hat\tau)$ using Fieller's asymptotic standard error $$ \operatorname{SE}(z) = \sqrt{\frac{0.437}{n - 4}}, $$ where $n$ is the pairwise complete-case sample size. The interval is then mapped back with tanh() and clipped to $[-1, 1]$ for numerical safety. This is the default Kendall CI and is intended to be the fast, production-oriented choice.

With ci_method = "brown_benedetti", the interval is computed on the Kendall tau scale using the Brown-Benedetti large-sample variance for Kendall's tau-b. This path is tie-aware, remains on the original Kendall scale, and is intended as a conventional asymptotic alternative when a direct tau-scale interval is preferred.

With ci_method = "if_el", the interval is computed in 'C++' using an influence-function empirical-likelihood construction built from the linearised Kendall estimating equation. The lower and upper limits are found by solving the empirical-likelihood ratio equation against the $\chi^2_1$-cutoff implied by conf_level. This method is slower than "fieller" and is intended for specialised inference.

Performance:

In the two-vector mode (y supplied), the C++ backend uses a raw-double path with minimal overhead.
In the matrix/data-frame mode, the no-missing estimate-only path uses the Knight (1966) $O(n \log n)$ algorithm. Pairwise-complete inference paths recompute each pair on its complete-case overlap; the "brown_benedetti" interval adds tie-aware large-sample variance calculations and "if_el" adds extra per-pair likelihood solving.

References

Kendall, M. G. (1938). A New Measure of Rank Correlation. Biometrika, 30(1/2), 81-93.

Knight, W. R. (1966). A Computer Method for Calculating Kendall's Tau with Ungrouped Data. Journal of the American Statistical Association, 61(314), 436-439.

Fieller, E. C., Hartley, H. O., & Pearson, E. S. (1957). Tests for rank correlation coefficients. I. Biometrika, 44(3/4), 470-481.

Brown, M. B., & Benedetti, J. K. (1977). Sampling behavior of tests for correlation in two-way contingency tables. Journal of the American Statistical Association, 72(358), 309-315.

Huang, Z., & Qin, G. (2023). Influence function-based confidence intervals for the Kendall rank correlation coefficient. Computational Statistics, 38(2), 1041-1055.

Croux, C., & Dehon, C. (2010). Influence functions of the Spearman and Kendall correlation measures. Statistical Methods & Applications, 19, 497-515.

Examples

Run this code

# Basic usage with a matrix
mat <- cbind(a = rnorm(100), b = rnorm(100), c = rnorm(100))
kt <- kendall_tau(mat)
print(kt)
summary(kt)
plot(kt)

# Confidence intervals
kt_ci <- kendall_tau(mat[, 1:3], ci = TRUE)
print(kt_ci, show_ci = "yes")
summary(kt_ci)

# Two-vector mode (scalar path)
x <- rnorm(1000); y <- 0.5 * x + rnorm(1000)
kendall_tau(x, y)

# Including ties
tied_df <- data.frame(
  v1 = rep(1:5, each = 20),
  v2 = rep(5:1, each = 20),
  v3 = rnorm(100)
)
kt_tied <- kendall_tau(tied_df, ci = TRUE, ci_method = "fieller")
print(kt_tied, show_ci = "yes")

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(kt)
}

Run the code above in your browser using DataLab