Performs the generalized permutation-based kernel two-sample tests proposed by Song and Chen (2021). The implementation here uses the kertests
implementation from the kerTests package. This function is inteded to be used e.g. in comparison studies where all four test statistics need to be calculated at the same time. Since large parts of the calculation coincide, using this function should be faster than computing all four statistics individually.
kerTests(X1, X2, n.perm = 0, sigma = findSigma(X1, X2), r1 = 1.2,
r2 = 0.8, seed = 42)
A list with the following components:
Observed values of the test statistics
Asymptotic or permutation p values
Needed for pretty printing of results
Needed for pretty printing of results
Description of the test
Needed for pretty printing of results
First dataset as matrix or data.frame
Second dataset as matrix or data.frame
Number of permutations for permutation test (default: 0, fast test is performed). For fast = FALSE
, only the permutation test and no asymptotic test is available. For fast = TRUE
, either an asymptotic test (set n.perm = 0
) and a permutation test (set n.perm
> 0) can be performed.
Bandwidth parameter of the kernel. By default the median heuristic is used to choose sigma
.
Constant in the test statistic \(Z_{W, r1}\) for the fast test (default: 1.2, proposed in original article)
Constant in the test statistic \(Z_{W, r2}\) for the fast test (default: 0.8, proposed in original article)
Random seed (default: 42)
Target variable? | Numeric? | Categorical? | K-sample? |
No | Yes | No | No |
The GPK test is motivated by the observation that the MMD test performs poorly for detecting differences in variances. The unbiased MMD\(^2\) estimator for a given kernel function \(k\) can be written as $$\text{MMD}_u^2 = \alpha + \beta - 2\gamma, \text{ where}$$ $$\alpha = \frac{1}{n_1^2 - n_1}\sum_{i=1}^{n_1}\sum_{j=1, j\ne i}^{n_1} k(X_{1i}, X_{1j}),$$ $$\beta = \frac{1}{n_2^2 - n_2}\sum_{i=1}^{n_2}\sum_{j=1, j\ne i}^{n_2} k(X_{2i}, X_{2j}),$$ $$\gamma = \frac{1}{n_1 n_2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_2} k(X_{1i}, X_{2j}).$$ The GPK test statistic is defined as $$\text{GPK} = (\alpha - \text{E}(\alpha), \beta - \text{E}(\beta))\Sigma^{-1} \binom{\alpha - \text{E}(\alpha)}{\beta - \text{E}(\beta)}$$ $$= Z_{W,1}^2 + Z_D^2\text{ with}$$ $$Z_{W,r} = \frac{W_r - \text{E}(W_r)}{\sqrt{\text{Var}(W_r)}}, W_r = r\frac{n_1 \alpha}{n_1 + n_2}, $$ $$Z_D = \frac{D - \text{E}(D)}{\sqrt{\text{Var}(D)}}, D = n_1(n_1 - 1)\alpha - n_2(n_2 - 1)\beta,$$ where the expectations are calculated under the null and \(\Sigma\) is the covariance matrix of \(\alpha\) and \(\beta\) under the null.
The asymptotic null distribution for GPK is unknown. Therefore, only a permutation test can be performed.
For \(r \ne 1\), the asymptotic null distribution of \(Z_{W,r}\) is normal, but for \(r\) further away from 1, the test performance decreases. Therefore, \(r_1 = 1.2\) and \(r_2 = 0.8\) are proposed as a compromise.
For the fast GPK test, three (asymptotic or permutation) tests based on \(Z_{W, r1}\), \(Z_{W, r2}\) and \(Z_{D}\) are concucted and the overall p value is calculated as 3 times the minimum of the three p values.
For the fast MMD test, only the two asymptotic tests based on \(Z_{W, r1}\), \(Z_{W, r2}\) are used and the p value is 2 times the minimum of the two p values. This is an approximation of the MMD-permutation test, see MMD
.
This implementation is a wrapper function around the function kertests
that modifies the in- and output of that function to match the other functions provided in this package. For more details see the kertests
.
Song, H. and Chen, H. (2021). Generalized Kernel Two-Sample Tests. arXiv preprint. tools:::Rd_expr_doi("10.1093/biomet/asad068").
Song H, Chen H (2023). kerTests: Generalized Kernel Two-Sample Tests. R package version 0.1.4, https://CRAN.R-project.org/package=kerTests
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")
GPK
, MMD
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform GPK tests
if(requireNamespace("kerTests", quietly = TRUE)) {
kerTests(X1, X2, n.perm = 100)
}
Run the code above in your browser using DataLab