YMRZL: Yu et al. (2007) Two-Sample Test

Description

Performs the Yu et al. (2007) two-sample test. The implementation here uses the classifier_test implementation from the Ecume package.

Usage

YMRZL(X1, X2, n.perm = 0, split = 0.7, control = NULL, 
       train.args = NULL, seed = 42)

Value

An object of class htest with the following components:

statistic: Observed value of the test statistic
p.value: Asymptotic p value
alternative: The alternative hypothesis
method: Description of the test
data.name: The dataset names
classifier: Chosen classification method (tree)

Arguments

X1: First dataset as matrix or data.frame
X2: Second dataset as matrix or data.frame
n.perm: Number of permutations for permutation test (default: 0, asymptotic test is performed).
split: Proportion of observations used for training
control: Control parameters for fitting. See trainControl. Defaults to caret::trainControl(method = "boot") as recommended if control = NULL. The number of Bootstrap samples defaults to 25 and can be set by specifying the number argument of caret::trainControl.
train.args: Further arguments passed to train as a named list.
seed: Random seed (default: 42)

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	Yes	No

Details

The two-sample test proposed by Yu et al. (2007) works by first combining the datasets into a pooled dataset and creating a target variable with the dataset membership of each observation. The pooled sample is then split into training and test set and a classification tree is trained on the training data. The test classification error is then used as a test statistic. If the distributions of the datasets do not differ, the classifier will be unable to distinguish between the datasets and therefore the test error will be close to chance level. The test rejects if the test error is smaller than chance level.

The tree model is fit by rpart and the classification error for tuning is by default predicted using the Bootstrap .632+ estimator as recommended by Yu et al. (2007).

For n.perm > 0, a permutation test is conducted. Otherwise, an asymptotic binomial test is performed.

References

Yu, K., Martin, R., Rothman, N., Zheng, T., Lan, Q. (2007). Two-sample Comparison Based on Prediction Error, with Applications to Candidate Gene Association Studies. Annals of Human Genetics, 71(1). tools:::Rd_expr_doi("10.1111/j.1469-1809.2006.00306.x")

Lopez-Paz, D., and Oquab, M. (2022). Revisiting classifier two-sample tests. ICLR 2017. https://openreview.net/forum?id=SJkXfE5xx

Roux de Bezieux, H. (2021). Ecume: Equality of 2 (or k) Continuous Univariate and Multivariate Distributions. R package version 0.9.1, https://CRAN.R-project.org/package=Ecume.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

Examples

Run this code

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform the Yu et al. test
YMRZL(X1, X2)

Run the code above in your browser using DataLab