mediandiff: Compute the median difference between pairs of values

Description

Compute the median of all differences between distinct pairs in vectors or matrices.

Usage

mediandiff(X, Y = NULL, kernel = c("Laplacian", "Gaussian"), fast = FALSE)

Value

A scalar, the median of all pairwise differences.

Arguments

X: Numeric vector or matrix of length n.
Y: Numeric vector or matrix of length m, or NULL.
kernel: String, either "Laplacian" or "Gaussian".
fast: Boolean; if TRUE will run $O(N \log N)$ algorithm, where N = n + m, but if FALSE (default) will run naive $O(N^2 \log N)$ algorithm.

Details

The median difference is defined as follows:

$Z$ is the combined $X$ and $Y$ values into a single vector or matrix. Number of columns is the dimension, and these need to be equal for $X$ and $Y$. Then, if the Laplacian kernel is used,

$$ m = \textnormal{median} \{ || x_i - x_j ||_1; \,\, i>j, \,\, i=1, 2,\dots, n+m,\,\,\textnormal{ and } j=1, 2,\dots, i-1 \}, $$

where $ || z_i - z_j ||_1$ is the 1-norm, and so if the data are d-dimensional then

$$ || z_i - z_j ||_1 = \sum_{k=1}^{d} |z_{i,k} - z_{j,k}|. $$

If the Gaussian kernel is specified, then the square of the two-norm is used.

The median heuristic is defined as beta = 1/m.

Naive method will compute all distinct pairs, of which there are $N(N+1)/2$ differences. These are then sorted using a $O(N \log N)$ algorithm, so overall $O(N^2 \log N)$.

The fast method is $O(N \log N)$ is from Croux and Rousseeuw (1992), which is based on Johnson and Mizoguchi (1978).

References

Croux, C. and Rousseeuw, P. J. (1992), "Time-Efficient Algorithms for Two Highly Robust Estimators of Scale" In Computational Statistics: Volume 1: Proceedings of the 10th Symposium on Computational Statistics (pp. 411-428).

Johnson, D.B., and Mizoguchi, T. (1978), "Selecting the Kth Element in X + Y and X_1 + X_2 + ... + X_m", SIAM Journal of Computing, 7, 147-153.

Examples

Run this code


X <- c(7.1, 1.2, 4.3, 0.4)
Y <- c(5.5, 2.6, 8.7)
#using fast method, Laplacian kernel, loglinear in number of observations
md <- mediandiff(X, Y, fast=TRUE)

#using fast method, Gaussian kernel, loglinear in number of observations
md <- mediandiff(X, Y, fast=TRUE, kernel="Gaussian")

#using naive method (default), with Laplacian kernel
md <- mediandiff(X, Y)

Run the code above in your browser using DataLab