C2ST: Classifier Two-Sample Test

Description

The function implements the Classifier Two-Sample Test (C2ST) of Lopez-Paz & Oquab (2017). The comparison of multiple ($\ge 2$) samples is also possible. The implementation here uses the classifier_test implementation from the Ecume package.

Usage

C2ST(X1, X2, ..., split = 0.7, thresh = 0, method = "knn", control = NULL, 
      train.args = NULL, seed = 42)

Value

An object of class htest with the following components:

statistic: Observed value of the test statistic
p.value: Asymptotic p value
alternative: The alternative hypothesis
method: Description of the test
data.name: The dataset names
classifier: Chosen classification method

Arguments

X1: First dataset as matrix or data.frame
X2: Second dataset as matrix or data.frame
...: Optionally more datasets as matrices or data.frames
split: Proportion of observations used for training
thresh: Value to add to the null hypothesis value (default:0). The null hypothesis tested can be formulated as $H_0: t = p_0 + $ thresh, where $t$ denotes the test accuracy of the classifier and $p_0$ is the chance level (proportion of largest dataset in pooled sample).
method: Classifier to use during training (default: "knn"). See details for possible options.
control: Control parameters for fitting. See trainControl. Defaults to NULL in which case it is set to caret::trainControl(method = "cv").
train.args: Further arguments passed to train as a named list.
seed: Random seed (default: 42)

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	Yes	Yes

Details

The classifier two-sample test works by first combining the datasets into a pooled dataset and creating a target variable with the dataset membership of each observation. The pooled sample is then split into training and test set and a classifier is trained on the training data. The classification accuracy on the test data is then used as a test statistic. If the distributions of the datasets do not differ, the classifier will be unable to distinguish between the datasets and therefore the test accuracy will be close to chance level. The test rejects if the test accuracy is greater than chance level.

All methods available for classification within the caret framework can be used as methods. A list of possible models can for example be retrieved via

names(caret::getModelInfo())[sapply(caret::getModelInfo(), function(x) "Classification" %in% x$type)]

This implementation is a wrapper function around the function classifier_test that modifies the in- and output of that function to match the other functions provided in this package. For more details see the classifier_test.

References

Lopez-Paz, D., and Oquab, M. (2022). Revisiting classifier two-sample tests. ICLR 2017. https://openreview.net/forum?id=SJkXfE5xx.

Roux de Bezieux, H. (2021). Ecume: Equality of 2 (or k) Continuous Univariate and Multivariate Distributions. R package version 0.9.1, https://CRAN.R-project.org/package=Ecume.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

Examples

Run this code

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform classifier two-sample test 
if(requireNamespace("Ecume", quietly = TRUE)) {
  C2ST(X1, X2)
}

Run the code above in your browser using DataLab