Computes Kendall's tau rank correlation either for all pairs of numeric columns in a matrix/data frame, or for two numeric vectors directly (scalar path).
This function uses a scalable algorithm implemented in 'C++' to compute Kendall's tau-b (tie-robust). When there are no ties, tau-b reduces to tau-a. The implementation follows the Knight (1966) \(O(n \log n)\) scheme, where a single sort on one variable, in-block sorting of the paired variable within tie groups, and a global merge-sort–based inversion count with closed-form tie corrections.
Prints a summary of the Kendall's tau correlation matrix, including description and method metadata.
Generates a ggplot2-based heatmap of the Kendall's tau correlation matrix.
kendall_tau(data, y = NULL, check_na = TRUE)# S3 method for kendall_matrix
print(x, digits = 4, max_rows = NULL, max_cols = NULL, ...)
# S3 method for kendall_matrix
plot(
x,
title = "Kendall's Tau correlation heatmap",
low_color = "indianred1",
high_color = "steelblue1",
mid_color = "white",
value_text_size = 4,
...
)
If y is NULL and data is a matrix/data frame: a
symmetric numeric matrix where entry (i, j) is the Kendall's tau
correlation between the i-th and j-th numeric columns.
If y is provided (two-vector mode): a single numeric scalar,
the Kendall's tau correlation between data and y.
Invisibly returns the kendall_matrix object.
A ggplot object representing the heatmap.
For matrix/data frame, it is expected a numeric matrix or a data frame with at
least two numeric columns. All non-numeric columns will be excluded.
For two-vector mode, a numeric vector x.
Optional numeric vector y of the same length as data
when data is a vector. If supplied, the function computes the
Kendall correlation between data and y using a
low-overhead scalar path and returns a single number.
Logical (default TRUE). If TRUE, inputs must
be free of missing/undefined values. Use FALSE only when you have
already filtered or imputed them.
An object of class kendall_matrix.
Integer; number of decimal places to print
Optional integer; maximum number of rows to display.
If NULL, all rows are shown.
Optional integer; maximum number of columns to display.
If NULL, all columns are shown.
Additional arguments passed to ggplot2::theme() or other
ggplot2 layers.
Plot title. Default is "Kendall's Tau Correlation
Heatmap".
Color for the minimum tau value. Default is
"indianred1".
Color for the maximum tau value. Default is
"steelblue1".
Color for zero correlation. Default is "white".
Font size for displaying correlation values. Default
is 4.
Thiago de Paula Oliveira
Kendall's tau is a rank-based measure of association between two variables. For a dataset with \(n\) observations on variables \(X\) and \(Y\), let \(n_0 = n(n - 1)/2\) be the number of unordered pairs, \(C\) the number of concordant pairs, and \(D\) the number of discordant pairs. Let \(T_x = \sum_g t_g (t_g - 1)/2\) and \(T_y = \sum_h u_h (u_h - 1)/2\) be the numbers of tied pairs within \(X\) and within \(Y\), respectively, where \(t_g\) and \(u_h\) are tie-group sizes in \(X\) and \(Y\).
The tie-robust Kendall's tau-b is: $$ \tau_b = \frac{C - D}{\sqrt{(n_0 - T_x)\,(n_0 - T_y)}}. $$ When there are no ties (\(T_x = T_y = 0\)), this reduces to tau-a: $$ \tau_a = \frac{C - D}{n(n-1)/2}. $$
The function automatically handles ties. In degenerate cases where a
variable is constant (\(n_0 = T_x\) or \(n_0 = T_y\)), the tau-b
denominator is zero and the correlation is undefined (returned as NA).
Performance:
In the two-vector mode (y supplied), the C++ backend uses a
raw-double path (no intermediate 2\(\times\)2 matrix, no discretisation).
In the matrix/data-frame mode, columns are discretised once and all pairwise correlations are computed via the Knight \(O(n \log n)\) procedure; where available, pairs are evaluated in parallel.
Kendall, M. G. (1938). A New Measure of Rank Correlation. Biometrika, 30(1/2), 81–93.
Knight, W. R. (1966). A Computer Method for Calculating Kendall’s Tau with Ungrouped Data. Journal of the American Statistical Association, 61(314), 436–439.
print.kendall_matrix, plot.kendall_matrix
# Basic usage with a matrix
mat <- cbind(a = rnorm(100), b = rnorm(100), c = rnorm(100))
kt <- kendall_tau(mat)
print(kt)
plot(kt)
# Two-vector mode (scalar path)
x <- rnorm(1000); y <- 0.5 * x + rnorm(1000)
kendall_tau(x, y)
# With a large data frame
df <- data.frame(x = rnorm(1e4), y = rnorm(1e4), z = rnorm(1e4))
kendall_tau(df)
# Including ties
tied_df <- data.frame(
v1 = rep(1:5, each = 20),
v2 = rep(5:1, each = 20),
v3 = rnorm(100)
)
kt <- kendall_tau(tied_df)
print(kt)
plot(kt)
Run the code above in your browser using DataLab