Learn R Programming

DataSimilarity (version 0.1.1)

BQS: Barakat et al. (1996) Two-Sample Test

Description

Performs the nearest-neighbor-based multivariate two-sample test of Barakat et al. (1996).

Usage

BQS(X1, X2, dist.fun = stats::dist, n.perm = 0, dist.args = NULL, seed = 42)

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

p.value

Permutation p value (if n.perm > 0)

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

dist.fun

Function for calculating a distance matrix on the pooled dataset (default: stats::dist, Euclidean distance).

n.perm

Number of permutations for permutation test (default: 0, no test is performed).

dist.args

Named list of further arguments passed to dist.fun (default: NULL).

seed

Random seed (default: 42)

Applicability

Target variable?Numeric?Categorical?K-sample?
NoYesNoNo

Details

The test is an extension of the Schilling (1986) and Henze (1988) neighbor test that bypasses choosing the number of nearest neighbors to consider. The Schilling-Henze test statistic is the proportion of edges connecting points from the same dataset in a K-nearest neighbor graph calculated on the pooled sample (standardized with expectation and SD under the null). Barakat et al. (1996) take the weighted sum of the Schilling-Henze test statistics for \(K = 1,\dots,N-1\), where \(N\) denotes the pooled sample size.

As for the Schilling-Henze test, low values of the test statistic indicate similarity of the datasets. Thus, the null hypothesis of equal distributions is rejected for high values. A permutation test is performed if n.perm is set to a positive number.

References

Barakat, A.S., Quade, D. and Salama, I.A. (1996), Multivariate Homogeneity Testing Using an Extended Concept of Nearest Neighbors. Biom. J., 38: 605-612. tools:::Rd_expr_doi("10.1002/bimj.4710380509")

Schilling, M. F. (1986). Multivariate Two-Sample Tests Based on Nearest Neighbors. Journal of the American Statistical Association, 81(395), 799-806. tools:::Rd_expr_doi("10.2307/2289012")

Henze, N. (1988). A Multivariate Two-Sample Test Based on the Number of Nearest Neighbor Type Coincidences. The Annals of Statistics, 16(2), 772-783.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

See Also

SH, FR, CF, CCS, ZC for other graph-based tests, FR_cat, CF_cat, CCS_cat, and ZC_cat for versions of the test for categorical data

Examples

Run this code
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform Barakat et al. test
BQS(X1, X2)

Run the code above in your browser using DataLab