Performs the Yu et al. (2007) two-sample test. The implementation here uses the classifier_test
implementation from the Ecume package.
YMRZL(X1, X2, n.perm = 0, split = 0.7, control = NULL,
train.args = NULL, seed = 42)
An object of class htest
with the following components:
Observed value of the test statistic
Asymptotic p value
The alternative hypothesis
Description of the test
The dataset names
Chosen classification method (tree)
First dataset as matrix or data.frame
Second dataset as matrix or data.frame
Number of permutations for permutation test (default: 0, asymptotic test is performed).
Proportion of observations used for training
Control parameters for fitting. See trainControl
. Defaults to caret::trainControl(method = "boot")
as recommended if control = NULL
. The number of Bootstrap samples defaults to 25 and can be set by specifying the number
argument of caret::trainControl
.
Further arguments passed to train
as a named list.
Random seed (default: 42)
Target variable? | Numeric? | Categorical? | K-sample? |
No | Yes | Yes | No |
The two-sample test proposed by Yu et al. (2007) works by first combining the datasets into a pooled dataset and creating a target variable with the dataset membership of each observation. The pooled sample is then split into training and test set and a classification tree is trained on the training data. The test classification error is then used as a test statistic. If the distributions of the datasets do not differ, the classifier will be unable to distinguish between the datasets and therefore the test error will be close to chance level. The test rejects if the test error is smaller than chance level.
The tree model is fit by rpart
and the classification error for tuning is by default predicted using the Bootstrap .632+ estimator as recommended by Yu et al. (2007).
For n.perm > 0
, a permutation test is conducted. Otherwise, an asymptotic binomial test is performed.
Yu, K., Martin, R., Rothman, N., Zheng, T., Lan, Q. (2007). Two-sample Comparison Based on Prediction Error, with Applications to Candidate Gene Association Studies. Annals of Human Genetics, 71(1). tools:::Rd_expr_doi("10.1111/j.1469-1809.2006.00306.x")
Lopez-Paz, D., and Oquab, M. (2022). Revisiting classifier two-sample tests. ICLR 2017. https://openreview.net/forum?id=SJkXfE5xx
Roux de Bezieux, H. (2021). Ecume: Equality of 2 (or k) Continuous Univariate and Multivariate Distributions. R package version 0.9.1, https://CRAN.R-project.org/package=Ecume.
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")
C2ST
, HMN
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform the Yu et al. test
YMRZL(X1, X2)
Run the code above in your browser using DataLab