The function implements the Classifier Two-Sample Test (C2ST) of Lopez-Paz & Oquab (2017). The comparison of multiple (\(\ge 2\)) samples is also possible. The implementation here uses the classifier_test
implementation from the Ecume package.
C2ST(X1, X2, ..., split = 0.7, thresh = 0, method = "knn", control = NULL,
train.args = NULL, seed = 42)
An object of class htest
with the following components:
Observed value of the test statistic
Asymptotic p value
The alternative hypothesis
Description of the test
The dataset names
Chosen classification method
First dataset as matrix or data.frame
Second dataset as matrix or data.frame
Optionally more datasets as matrices or data.frames
Proportion of observations used for training
Value to add to the null hypothesis value (default:0). The null hypothesis tested can be formulated as \(H_0: t = p_0 + \) thresh
, where \(t\) denotes the test accuracy of the classifier and \(p_0\) is the chance level (proportion of largest dataset in pooled sample).
Classifier to use during training (default: "knn"
). See details for possible options.
Control parameters for fitting. See trainControl
. Defaults to NULL
in which case it is set to caret::trainControl(method = "cv")
.
Further arguments passed to train
as a named list.
Random seed (default: 42)
Target variable? | Numeric? | Categorical? | K-sample? |
No | Yes | Yes | Yes |
The classifier two-sample test works by first combining the datasets into a pooled dataset and creating a target variable with the dataset membership of each observation. The pooled sample is then split into training and test set and a classifier is trained on the training data. The classification accuracy on the test data is then used as a test statistic. If the distributions of the datasets do not differ, the classifier will be unable to distinguish between the datasets and therefore the test accuracy will be close to chance level. The test rejects if the test accuracy is greater than chance level.
All methods available for classification within the caret framework can be used as methods. A list of possible models can for example be retrieved via
names(caret::getModelInfo())[sapply(caret::getModelInfo(), function(x) "Classification" %in% x$type)]
This implementation is a wrapper function around the function classifier_test
that modifies the in- and output of that function to match the other functions provided in this package. For more details see the classifier_test
.
Lopez-Paz, D., and Oquab, M. (2022). Revisiting classifier two-sample tests. ICLR 2017. https://openreview.net/forum?id=SJkXfE5xx.
Roux de Bezieux, H. (2021). Ecume: Equality of 2 (or k) Continuous Univariate and Multivariate Distributions. R package version 0.9.1, https://CRAN.R-project.org/package=Ecume.
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")
HMN
, YMRZL
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform classifier two-sample test
if(requireNamespace("Ecume", quietly = TRUE)) {
C2ST(X1, X2)
}
Run the code above in your browser using DataLab