First checks number of columns (dimension) are equal.
Suppose matrix \(X\) has \(n\) rows and \(d\) columns,
and matrix \(Y\) has \(m\) rows; checks that \(Y\)
has \(d\) columns (if not, then throws error).
Then flattens matrices to vectors (or, if \(d=1\), they are
already vectors.
Then calls C++ method. If the first sample has \(n\)
\(d\)-dimensional samples and the second sample has
\(m\) \(d\)-dimensional samples, then the algorithm
computes the statistic in \(O((n+m)^2)\) time.
Median difference is as follows:
$$ m = \textnormal{median} \{ || x_i - x_j ||_1; \,\, i>j, \,\,
i=1, 2,\dots, n+m,\,\,\textnormal{ and } j=1, 2,\dots, i-1 \}, $$
where \( || x_i - x_j ||_1\) is the 1-norm, and so if the data
are \(d\)-dimensional then
$$ || x_i - x_j ||_1 = \sum_{k=1}^{d} |x_{i,k} - x_{j,k}|, $$
and finally median heuristic is beta = 1/m
.
This can be computed in \(O( (n+m)^2 )\) time.
The Laplacian kernel \(k\) is defined as
$$ k(x,y) = \exp( -\beta || x_i - x_j ||_1 ). $$
Random seed is set for std::mt19937
and std::shuffle
in C++.