Performs a permutation two-sample test based on the Wasserstein distance. The implementation here uses the wasserstein_permut
implementation from the Ecume package.
Wasserstein(X1, X2, n.perm = 0, fast = (nrow(X1) + nrow(X2)) > 1000,
S = max(1000, (nrow(X1) + nrow(X2))/2), seed = 42, ...)
An object of class htest
with the following components:
Observed value of the test statistic
Asymptotic p value
The alternative hypothesis
Description of the test
The dataset names
First dataset as matrix or data.frame
Second dataset as matrix or data.frame
Number of permutations for permutation test (default: 0, no test is performed).
Should the subwasserstein
approximate function be used? (default: TRUE
if the pooled sample size is more than 1000)
Number of samples to use for approximation if fast = TRUE
. See subwasserstein
Random seed (default: 42)
Other parameters passed to wasserstein
or wasserstein1d
, e.g. the power \(p\ge 1\)
Target variable? | Numeric? | Categorical? | K-sample? |
No | Yes | No | No |
A permutation test for the \(p\)-Wasserstein distance is performed. By default, the 1-Wasserstein distance is calculated using Euclidean distances. The \(p\)-Wasserstein distance between two probability measures \(\mu\) and \(\nu\) on a Euclidean space \(M\) is defined as $$W_p(\mu, \nu) = \left(\inf_{\gamma\in\Gamma(\mu,\nu)}\int_{M\times M} ||x - y||^p \text{d} \gamma(x, y)\right)^{\frac{1}{p}},$$ where \(\Gamma(\mu,\nu)\) is the set of probability measures on \(M\times M\) such that \(\mu\) and \(\nu\) are the marginal distributions.
As the Wasserstein distance of two distributions is a metric, it is zero if and only if the distributions coincides. Therefore, low values of the statistic indicate similarity of the datasets and the test rejects for high values.
This implementation is a wrapper function around the function wasserstein_permut
that modifies the in- and output of that function to match the other functions provided in this package. For more details see the wasserstein_permut
.
Rachev, S. T. (1991). Probability metrics and the stability of stochastic models. John Wiley & Sons, Chichester.
Roux de Bezieux, H. (2021). Ecume: Equality of 2 (or \(k\)) Continuous Univariate and Multivariate Distributions. R package version 0.9.1, https://CRAN.R-project.org/package=Ecume
Schuhmacher, D., Bähre, B., Gottschlich, C., Hartmann, V., Heinemann, F., Schmitzer, B. and Schrieber, J. (2019). transport: Computation of Optimal Transport Plans and Wasserstein Distances. R package version 0.15-0. https://cran.r-project.org/package=transport
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform Wasserstein distance based test
if(requireNamespace("Ecume", quietly = TRUE)) {
Wasserstein(X1, X2, n.perm = 100)
}
Run the code above in your browser using DataLab