Learn R Programming

DataSimilarity (version 0.1.1)

BG: Biau and Gyorfi (2005) two-sample homogeneity test

Description

The function implements the Biau and Gyorfi (2005) two-sample homogeneity test. This test uses the \(L_1\)-distance between two empicial distribution functions restricted to a finite partition.

Usage

BG(X1, X2, partition = rectPartition, exponent = 0.8, eps = 0.01, seed = 42, ...)

Value

An object of class htest with the following components:

statistic

Observed value of the (asymptotic) test statistic

p.value

p value

method

Description of the test

data.name

The dataset names

alternative

The alternative hypothesis

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame of the same sample size as X1

partition

Function that creates a finite partition for the subspace spanned by the two datasets (default: rectPartition, see Details)

exponent

Exponent used in the partition function, should be between 0 and 1 (default: 0.8)

eps

Small threshold to guarantee edge points are included (default: 0.01)

seed

Random seed (default: 42)

...

Further arguments to be passed to the partition function

Applicability

Target variable?Numeric?Categorical?K-sample?
NoYesNoNo

Details

The Biau and Gyorfi (2005) two-sample homogeneity test is defined for two datasets of the same sample size.

By default a rectangular partition (rectPartition) is being calculated under the assumption of approximately equal cell probabilities. Use the exponent argument to choose the number of elements of the partition \(m_n\) accoring to the convergence criteria in Biau and Gyorfi (2005). By default choose \(m_n = n^{0.8}\). For each of the \(p\) variables of the datasets, create \(m_n^{1/p} + 1\) cutpoints along the range of both datasets to define the partition, and ensure at least three cutpoints exist per variable (min, max, and one point splitting the data into two bins).

The test statistic is the \(L_1\)-distance between the vectors of the proportions of points falling into each cell of the partition for each dataset. An asymptotic test is performed using a standardized version of the \(L_1\) distance that is approximately standard normally distributed (Corollary to Theorem 2 in Biau and Gyorfi (2005)). Low values of the test statistic indicate similarity. Therefore, the test rejects for large values of the test statistic.

References

Biau G. and Gyorfi, L. (2005). On the asymptotic properties of a nonparametric \(L_1\)-test statistic of homogeneity, IEEE Transactions on Information Theory, 51(11), 3965-3973. tools:::Rd_expr_doi("10.1109/TIT.2005.856979")

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

See Also

rectPartition

Examples

Run this code
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform BG test 
BG(X1, X2)

Run the code above in your browser using DataLab