The function implements the Biau and Gyorfi (2005) two-sample homogeneity test. This test uses the \(L_1\)-distance between two empicial distribution functions restricted to a finite partition.
BG(X1, X2, partition = rectPartition, exponent = 0.8, eps = 0.01, seed = 42, ...)
An object of class htest
with the following components:
Observed value of the (asymptotic) test statistic
p value
Description of the test
The dataset names
The alternative hypothesis
First dataset as matrix or data.frame
Second dataset as matrix or data.frame of the same sample size as X1
Function that creates a finite partition for the subspace spanned by the two datasets (default: rectPartition
, see Details)
Exponent used in the partition function, should be between 0 and 1 (default: 0.8)
Small threshold to guarantee edge points are included (default: 0.01)
Random seed (default: 42)
Further arguments to be passed to the partition function
Target variable? | Numeric? | Categorical? | K-sample? |
No | Yes | No | No |
The Biau and Gyorfi (2005) two-sample homogeneity test is defined for two datasets of the same sample size.
By default a rectangular partition (rectPartition
) is being calculated under the assumption of approximately equal cell probabilities. Use the exponent
argument to choose the number of elements of the partition \(m_n\) accoring to the convergence criteria in Biau and Gyorfi (2005). By default choose \(m_n = n^{0.8}\). For each of the \(p\) variables of the datasets, create \(m_n^{1/p} + 1\) cutpoints along the range of both datasets to define the partition, and ensure at least three cutpoints exist per variable (min, max, and one point splitting the data into two bins).
The test statistic is the \(L_1\)-distance between the vectors of the proportions of points falling into each cell of the partition for each dataset. An asymptotic test is performed using a standardized version of the \(L_1\) distance that is approximately standard normally distributed (Corollary to Theorem 2 in Biau and Gyorfi (2005)). Low values of the test statistic indicate similarity. Therefore, the test rejects for large values of the test statistic.
Biau G. and Gyorfi, L. (2005). On the asymptotic properties of a nonparametric \(L_1\)-test statistic of homogeneity, IEEE Transactions on Information Theory, 51(11), 3965-3973. tools:::Rd_expr_doi("10.1109/TIT.2005.856979")
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")
rectPartition
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform BG test
BG(X1, X2)
Run the code above in your browser using DataLab