BallDivergence: Ball Divergence based two- or $k$ -sample test

Description

The function implements the Pan et al. (2018) multivariate two- or $k$ -sample test based on the Ball Divergence. The implementation here uses the bd.test implementation from the Ball package.

Usage

BallDivergence(X1, X2, ..., n.perm = 0, seed = 42, num.threads = 0, 
                kbd.type = "sum", weight = c("constant", "variance"), 
                args.bd.test = NULL)

Value

An object of class htest with the following components:

statistic: Observed value of the test statistic
p.value: Permutation p value (only if n.perm > 0 and for two datasets)
n.perm: Number of permutations for permutation test
size: Number of observations for each dataset
method: Description of the test
data.name: The dataset names
alternative: The alternative hypothesis

Arguments

X1: First dataset as matrix or data.frame
X2: Second dataset as matrix or data.frame
...: Optionally more datasets as matrices or data.frames
n.perm: Number of permutations for permutation test (default: 0, no permutation test performed). Note that for more than two samples, no test is performed.
seed: Random seed (default: 42)
num.threads: Number of threads (default: 0, all available cores are used)
kbd.type: Character specifying which k-sample test statistic will be used. Must be one of "sum" (default), "maxsum", or "max".
weight: Character specifying the weight form of the Ball Divergence test statistic. Must be one of "constant" (default) or "variance".
args.bd.test: Further arguments passed to bd.test as a named list.

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	No	Yes

Details

For n.perm = 0, the asymptotic test is performed. For n.perm > 0, a permutation test is performed.

The Ball Divergence is defined as the square of the measure difference over a given closed ball collection. The empirical test performed here is based on the difference between averages of metric ranks. It is robust to outliers and heavy-tailed data and suitable for imbalanced sample sizes.

The Ball Divergence of two distributions is zero if and only if the distributions coincide. Therefore, low values of the test statistic indicate similarity and the test rejects for large values of the test statistic.

For the $k$ -sample problem the pairwise Ball divergences can be summarized in different ways. First, one can simply sum up all pairwise Ball divergences (kbd.type = "sum"). Next, one can find the sample with the largest difference to the other, i.e. take the maximum of the sums of all Ball divergences for each sample with all other samples (kbd.type = "maxsum"). Last, one can sum up the largest $k - 1$ pairwise Ball divergences (kbd.type = "max").

This implementation is a wrapper function around the function bd.test that modifies the in- and output of that function to match the other functions provided in this package. For more details see bd.test and bd.

References

Pan, W., T. Y. Tian, X. Wang, H. Zhang (2018). Ball Divergence: Nonparametric two sample test, Annals of Statistics 46(3), 1109-1137, tools:::Rd_expr_doi("10.1214/17-AOS1579").

J. Zhu, W. Pan, W. Zheng, and X. Wang (2021). Ball: An R Package for Detecting Distribution Difference and Association in Metric Spaces, Journal of Statistical Software, 97(6), tools:::Rd_expr_doi("10.18637/jss.v097.i06")

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

Examples

Run this code

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Calculate Ball Divergence and perform test 
if(requireNamespace("Ball", quietly = TRUE)) {
  BallDivergence(X1, X2, n.perm = 100)
}

Run the code above in your browser using DataLab

Last chance! 50% off unlimited learning