Learn R Programming

DataSimilarity (version 0.1.1)

Wasserstein: Wasserstein Distance based Test

Description

Performs a permutation two-sample test based on the Wasserstein distance. The implementation here uses the wasserstein_permut implementation from the Ecume package.

Usage

Wasserstein(X1, X2, n.perm = 0, fast = (nrow(X1) + nrow(X2)) > 1000, 
            S = max(1000, (nrow(X1) + nrow(X2))/2), seed = 42, ...)

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

p.value

Asymptotic p value

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

n.perm

Number of permutations for permutation test (default: 0, no test is performed).

fast

Should the subwasserstein approximate function be used? (default: TRUE if the pooled sample size is more than 1000)

S

Number of samples to use for approximation if fast = TRUE. See subwasserstein

seed

Random seed (default: 42)

...

Other parameters passed to wasserstein or wasserstein1d, e.g. the power \(p\ge 1\)

Applicability

Target variable?Numeric?Categorical?K-sample?
NoYesNoNo

Details

A permutation test for the \(p\)-Wasserstein distance is performed. By default, the 1-Wasserstein distance is calculated using Euclidean distances. The \(p\)-Wasserstein distance between two probability measures \(\mu\) and \(\nu\) on a Euclidean space \(M\) is defined as $$W_p(\mu, \nu) = \left(\inf_{\gamma\in\Gamma(\mu,\nu)}\int_{M\times M} ||x - y||^p \text{d} \gamma(x, y)\right)^{\frac{1}{p}},$$ where \(\Gamma(\mu,\nu)\) is the set of probability measures on \(M\times M\) such that \(\mu\) and \(\nu\) are the marginal distributions.

As the Wasserstein distance of two distributions is a metric, it is zero if and only if the distributions coincides. Therefore, low values of the statistic indicate similarity of the datasets and the test rejects for high values.

This implementation is a wrapper function around the function wasserstein_permut that modifies the in- and output of that function to match the other functions provided in this package. For more details see the wasserstein_permut.

References

Rachev, S. T. (1991). Probability metrics and the stability of stochastic models. John Wiley & Sons, Chichester.

Roux de Bezieux, H. (2021). Ecume: Equality of 2 (or \(k\)) Continuous Univariate and Multivariate Distributions. R package version 0.9.1, https://CRAN.R-project.org/package=Ecume

Schuhmacher, D., Bähre, B., Gottschlich, C., Hartmann, V., Heinemann, F., Schmitzer, B. and Schrieber, J. (2019). transport: Computation of Optimal Transport Plans and Wasserstein Distances. R package version 0.15-0. https://cran.r-project.org/package=transport

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

Examples

Run this code
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform Wasserstein distance based test 
if(requireNamespace("Ecume", quietly = TRUE)) {
  Wasserstein(X1, X2, n.perm = 100)
}

Run the code above in your browser using DataLab