Wasserstein: Wasserstein Distance based Test

Description

Performs a permutation two-sample test based on the Wasserstein distance. The implementation here uses the wasserstein_permut implementation from the Ecume package.

Usage

Wasserstein(X1, X2, n.perm = 0, fast = (nrow(X1) + nrow(X2)) > 1000, 
            S = max(1000, (nrow(X1) + nrow(X2))/2), seed = 42, ...)

Value

An object of class htest with the following components:

statistic: Observed value of the test statistic
p.value: Asymptotic p value
alternative: The alternative hypothesis
method: Description of the test
data.name: The dataset names

Arguments

X1: First dataset as matrix or data.frame
X2: Second dataset as matrix or data.frame
n.perm: Number of permutations for permutation test (default: 0, no test is performed).
fast: Should the subwasserstein approximate function be used? (default: TRUE if the pooled sample size is more than 1000)
S: Number of samples to use for approximation if fast = TRUE. See subwasserstein
seed: Random seed (default: 42)
...: Other parameters passed to wasserstein or wasserstein1d, e.g. the power $p\ge 1$

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	No	No

Details

A permutation test for the $p$-Wasserstein distance is performed. By default, the 1-Wasserstein distance is calculated using Euclidean distances. The $p$-Wasserstein distance between two probability measures $\mu$ and $\nu$ on a Euclidean space $M$ is defined as $$W_p(\mu, \nu) = \left(\inf_{\gamma\in\Gamma(\mu,\nu)}\int_{M\times M} ||x - y||^p \text{d} \gamma(x, y)\right)^{\frac{1}{p}},$$ where $\Gamma(\mu,\nu)$ is the set of probability measures on $M\times M$ such that $\mu$ and $\nu$ are the marginal distributions.

As the Wasserstein distance of two distributions is a metric, it is zero if and only if the distributions coincides. Therefore, low values of the statistic indicate similarity of the datasets and the test rejects for high values.

This implementation is a wrapper function around the function wasserstein_permut that modifies the in- and output of that function to match the other functions provided in this package. For more details see the wasserstein_permut.

References

Rachev, S. T. (1991). Probability metrics and the stability of stochastic models. John Wiley & Sons, Chichester.

Roux de Bezieux, H. (2021). Ecume: Equality of 2 (or $k$) Continuous Univariate and Multivariate Distributions. R package version 0.9.1, https://CRAN.R-project.org/package=Ecume

Schuhmacher, D., Bähre, B., Gottschlich, C., Hartmann, V., Heinemann, F., Schmitzer, B. and Schrieber, J. (2019). transport: Computation of Optimal Transport Plans and Wasserstein Distances. R package version 0.15-0. https://cran.r-project.org/package=transport

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

Examples

Run this code

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform Wasserstein distance based test 
if(requireNamespace("Ecume", quietly = TRUE)) {
  Wasserstein(X1, X2, n.perm = 100)
}

Run the code above in your browser using DataLab