Learn R Programming

matrixCorr (version 0.8.3)

kendall_tau: Pairwise (or Two-Vector) Kendall's Tau Rank Correlation

Description

Computes Kendall's tau rank correlation either for all pairs of numeric columns in a matrix/data frame, or for two numeric vectors directly (scalar path).

This function uses a scalable algorithm implemented in 'C++' to compute Kendall's tau-b (tie-robust). When there are no ties, tau-b reduces to tau-a. The implementation follows the Knight (1966) \(O(n \log n)\) scheme, where a single sort on one variable, in-block sorting of the paired variable within tie groups, and a global merge-sort–based inversion count with closed-form tie corrections.

Prints a summary of the Kendall's tau correlation matrix, including description and method metadata.

Generates a ggplot2-based heatmap of the Kendall's tau correlation matrix.

Usage

kendall_tau(data, y = NULL, check_na = TRUE)

# S3 method for kendall_matrix print(x, digits = 4, max_rows = NULL, max_cols = NULL, ...)

# S3 method for kendall_matrix plot( x, title = "Kendall's Tau correlation heatmap", low_color = "indianred1", high_color = "steelblue1", mid_color = "white", value_text_size = 4, ... )

Value

  • If y is NULL and data is a matrix/data frame: a symmetric numeric matrix where entry (i, j) is the Kendall's tau correlation between the i-th and j-th numeric columns.

  • If y is provided (two-vector mode): a single numeric scalar, the Kendall's tau correlation between data and y.

Invisibly returns the kendall_matrix object.

A ggplot object representing the heatmap.

Arguments

data

For matrix/data frame, it is expected a numeric matrix or a data frame with at least two numeric columns. All non-numeric columns will be excluded. For two-vector mode, a numeric vector x.

y

Optional numeric vector y of the same length as data when data is a vector. If supplied, the function computes the Kendall correlation between data and y using a low-overhead scalar path and returns a single number.

check_na

Logical (default TRUE). If TRUE, inputs must be free of missing/undefined values. Use FALSE only when you have already filtered or imputed them.

x

An object of class kendall_matrix.

digits

Integer; number of decimal places to print

max_rows

Optional integer; maximum number of rows to display. If NULL, all rows are shown.

max_cols

Optional integer; maximum number of columns to display. If NULL, all columns are shown.

...

Additional arguments passed to ggplot2::theme() or other ggplot2 layers.

title

Plot title. Default is "Kendall's Tau Correlation Heatmap".

low_color

Color for the minimum tau value. Default is "indianred1".

high_color

Color for the maximum tau value. Default is "steelblue1".

mid_color

Color for zero correlation. Default is "white".

value_text_size

Font size for displaying correlation values. Default is 4.

Author

Thiago de Paula Oliveira

Details

Kendall's tau is a rank-based measure of association between two variables. For a dataset with \(n\) observations on variables \(X\) and \(Y\), let \(n_0 = n(n - 1)/2\) be the number of unordered pairs, \(C\) the number of concordant pairs, and \(D\) the number of discordant pairs. Let \(T_x = \sum_g t_g (t_g - 1)/2\) and \(T_y = \sum_h u_h (u_h - 1)/2\) be the numbers of tied pairs within \(X\) and within \(Y\), respectively, where \(t_g\) and \(u_h\) are tie-group sizes in \(X\) and \(Y\).

The tie-robust Kendall's tau-b is: $$ \tau_b = \frac{C - D}{\sqrt{(n_0 - T_x)\,(n_0 - T_y)}}. $$ When there are no ties (\(T_x = T_y = 0\)), this reduces to tau-a: $$ \tau_a = \frac{C - D}{n(n-1)/2}. $$

The function automatically handles ties. In degenerate cases where a variable is constant (\(n_0 = T_x\) or \(n_0 = T_y\)), the tau-b denominator is zero and the correlation is undefined (returned as NA).

Performance:

  • In the two-vector mode (y supplied), the C++ backend uses a raw-double path (no intermediate 2\(\times\)2 matrix, no discretisation).

  • In the matrix/data-frame mode, columns are discretised once and all pairwise correlations are computed via the Knight \(O(n \log n)\) procedure; where available, pairs are evaluated in parallel.

References

Kendall, M. G. (1938). A New Measure of Rank Correlation. Biometrika, 30(1/2), 81–93.

Knight, W. R. (1966). A Computer Method for Calculating Kendall’s Tau with Ungrouped Data. Journal of the American Statistical Association, 61(314), 436–439.

See Also

print.kendall_matrix, plot.kendall_matrix

Examples

Run this code
# Basic usage with a matrix
mat <- cbind(a = rnorm(100), b = rnorm(100), c = rnorm(100))
kt <- kendall_tau(mat)
print(kt)
plot(kt)

# Two-vector mode (scalar path)
x <- rnorm(1000); y <- 0.5 * x + rnorm(1000)
kendall_tau(x, y)

# With a large data frame
df <- data.frame(x = rnorm(1e4), y = rnorm(1e4), z = rnorm(1e4))
kendall_tau(df)

# Including ties
tied_df <- data.frame(
  v1 = rep(1:5, each = 20),
  v2 = rep(5:1, each = 20),
  v3 = rnorm(100)
)
kt <- kendall_tau(tied_df)
print(kt)
plot(kt)

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(kt)
}

Run the code above in your browser using DataLab