Learn R Programming

DataSimilarity (version 0.1.1)

BMG: Biswas et al. (2014) two-sample run test

Description

The function implements the Biswas, Mukhopadhyay and Gosh (2014) distribution-free two-sample run test. This test uses a heuristic approach to calculate the shortest Hamilton path between the two datasets using the HamiltonPath function. By default the asymptotic version of the test is calculated.

Usage

BMG(X1, X2, seed = 42, asymptotic = TRUE)

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic (note: this is not the asymptotic test statistic)

p.value

(asymptotic) p value

method

Description of the test

data.name

The dataset names

alternative

The alternative hypothesis

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

seed

Random seed (default: 42)

asymptotic

Should the asymptotic version of the test be performed (default: TRUE)

Applicability

Target variable?Numeric?Categorical?K-sample?
NoYesNoNo

Details

The test counts the number of edges in the shortest Hamilton path calculated on the pooled sample that connect points from different samples, i.e. $$T_{m,n} = 1 + \sum_{i = 1}^{N-1} U_i, $$ where \(U_i\) is an indicator function with \(U_i = 1\) if the \(i\)th edge connects points from different samples and \(U_i = 0\) otherwise.

For a combined sample size N smaller or equal to 1030, the exact version of the Biswas, Mukhopadhyay and Gosh (2014) test can be calculated. It uses the univariate run statistic (Wald and Wolfowitz, 1940) to calculate the test statistic. For N larger than 1030, the calculation for the exact version breaks.

If an asymptotic test is performed the asymptotic null distribution is given by $$T_{m, n}^{*} \sim \mathcal{N}(0, 4\lambda^2(1-\lambda)^2)$$ where \(T_{m, n}^{*}= \sqrt{N} (T_{m, n} / N - 2 \lambda (1 - \lambda))\) the asymptotic test statistic, \(\lambda = m/N\) and \(m\) is the sample size of the first dataset. Therefore, low absolute values of the asymptotic test statistic indicate similarity of the two datasets whereas high absolute values indicate differences between the datasets.

References

Biswas, M., Mukhopadhyay, M. and Ghosh, A. K. (2014). A distribution-free two-sample run test applicable to high-dimensional data, Biometrika 101 (4), 913-926, tools:::Rd_expr_doi("10.1093/biomet/asu045")

Wald, A. and Wolfowitz, J. (1940). On a test whether two samples are from the same distribution, Annals of Mathematical Statistic 11, 147-162

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

See Also

HamiltonPath

Examples

Run this code
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform BMG test 
BMG(X1, X2)

Run the code above in your browser using DataLab