Learn R Programming

DataSimilarity (version 0.1.1)

LHZ: Li et al. (2022) empirical characteristic distance

Description

The function implements the Li et al. (2022) empirical characteristic distance between two datasets.

Usage

LHZ(X1, X2, n.perm = 0, seed = 42)

Value

An object of class htest with the following components:

method

Description of the test

statistic

Observed value of the test statistic

p.value

Permutation p value (only if n.perm > 0)

data.name

The dataset names

alternative

The alternative hypothesis

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

n.perm

Number of permutations for permutation test (default: 0, no permutation test performed)

seed

Random seed (default: 42)

Applicability

Target variable?Numeric?Categorical?K-sample?
NoYesNoNo

Details

The test statistic $$T_{n, m} = \frac{1}{n^2} \sum_{j, q = 1}^n \left( \left\Vert \frac{1}{n} \sum_{k=1}^n e^{i\langle X_k, X_j-X_q \rangle} - \frac{1}{m} \sum_{l=1}^m e^{i\langle Y_l, X_j-X_q\rangle} \right\Vert^2 \right) + \frac{1}{m^2} \sum_{j, q = 1}^m \left( \left\Vert \frac{1}{n} \sum_{k=1}^n e^{i\langle X_k, Y_j-Y_q \rangle} - \frac{1}{m} \sum_{l=1}^m e^{i\langle Y_l, Y_j-Y_q\rangle} \right\Vert^2 \right) $$ is calculated according to Li et al. (2022). The datasets are denoted by \(X\) and \(Y\) with respective sample sizes \(n\) and \(m\). By \(X_j\) the \(i\)-th row of dataset \(X\) is denoted. Furthermore, \(\Vert \cdot \Vert\) indicates the Euclidian norm and \(\langle X_i, X_j \rangle\) indicates the inner product between \(X_i\) and \(X_j\).

Low values of the test statistic indicate similarity. Therefore, the permutation test rejects for large values of the test statistic.

References

Li, X., Hu, W. and Zhang, B. (2022). Measuring and testing homogeneity of distributions by characteristic distance, Statistical Papers 64 (2), 529-556, tools:::Rd_expr_doi("10.1007/s00362-022-01327-7")

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

See Also

LHZStatistic

Examples

Run this code
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Calculate LHZ statistic
LHZ(X1, X2)

Run the code above in your browser using DataLab