Calculates the kernel measure of multi-sample dissimilarity (KMD) and performs a permutation multi-sample test (Huang and Sen, 2023). The implementation here uses the KMD
and KMD_test
implementations from the KMD package.
KMD(X1, X2, ..., n.perm = 0, graph = "knn", k = ceiling(N/10),
kernel = "discrete", seed = 42)
An object of class htest
with the following components:
Observed value of the test statistic
Permutation / asymptotic p value
Estimated KMD value
The alternative hypothesis
Description of the test
The dataset names
Graph used for calculation
Number of neighbors used if graph
is the KNN graph.
Kernel used for calculation
First dataset as matrix or data.frame
Second dataset as matrix or data.frame
Optionally more datasets as matrices or data.frames
Number of permutations for permutation test (default: 0, no permutation test performed).
Graph used in calculation of KMD. Possible options are "knn"
(default) and "mst"
.
Number of neighbors for construction of k
-nearest neighbor graph. Ignored for graph = "mst"
.
Kernel used in calculation of KMD. Can either be "discrete"
(default) for use of the discrete kernel or a kernal matrix with numbers of rows and columns corresponding to the number of datasets. For the latter, the entry in the \(i\)-th row and \(j\)-th column corresponds to the kernel value \(k(i,j)\).
Random seed (default: 42)
Target variable? | Numeric? | Categorical? | K-sample? |
No | Yes | No | Yes |
Given the pooled sample \(Z_1, \dots, Z_N\) and the corresponding sample memberships \(\Delta_1,\dots, \Delta_N\) let \(\mathcal{G}\) be a geometric graph on \(\mathcal{X}\) such that an edge between two points \(Z_i\) and \(Z_j\) in the pooled sample implies that \(Z_i\) and \(Z_j\) are close, e.g. \(K\)-nearest neighbor graph with \(K\ge 1\) or MST. Denote by \((Z_i,Z_j)\in\mathcal{E}(\mathcal{G})\) that there is an edge in \(\mathcal{G}\) connecting \(Z_i\) and \(Z_j\). Moreover, let \(o_i\) be the out-degree of \(Z_i\) in \(\mathcal{G}\). Then an estimator for the KMD \(\eta\) is defined as $$\hat{\eta} := \frac{\frac{1}{N} \sum_{i=1}^N \frac{1}{o_i} \sum_{j:(Z_i,Z_j)\in\mathcal{E}(\mathcal{G})} K(\Delta_i, \Delta_j) - \frac{1}{N(N-1)} \sum_{i\ne j} K(\Delta_i, \Delta_j)}{\frac{1}{N}\sum_{i=1}^N K(\Delta_i, \Delta_i) - \frac{1}{N(N-1)} \sum_{i\ne j} K(\Delta_i, \Delta_j)}.$$
Euclidean distances are used for computing the KNN graph (ties broken at random) and the MST.
For n.perm == 0
, an asymptotic test using the asymptotic normal approximation of the null distribution is performed. For this, the KMD is standardized by the null mean and standard deviation. For n.perm > 0
, a permutation test is performed, i.e. the observed KMD statistic is compared to the permutation KMD statistics.
The theoretical KMD of two distributions is zero if and only if the distributions coincide. It is upper bound by one. Therefore, low values of the empirical KMD indicate similarity and the test rejects for high values.
Huang and Sen (2023) recommend using the \(k\)-NN graph for its flexibility, but the choice of \(k\) is unclear. Based on the simulation results in the original article, the recommended values are \(k = 0.1 N\) for testing and \(k = 1\) for estimation. For increasing power it is beneficial to choose large values of \(k\), for consistency of the tests, \(k = o(N / \log(N))\) together with a continuous distribution of inter-point distances is sufficient, i.e. \(k\) cannot be chosen too large compared to \(N\). On the other hand, in the context of estimating the KMD, choosing \(k\) is a bias-variance trade-off with small values of \(k\) decreasing the bias and larger values of \(k\) decreasing the variance (for more details see discussion in Appendix D.3 of Huang and Sen (2023)).
This implementation is a wrapper function around the functions KMD
and KMD_test
that modifies the in- and output of those functions to match the other functions provided in this package. For more details see KMD
and KMD_test
.
Huang, Z. and Sen, B. (2023). A Kernel Measure of Dissimilarity between \(M\) Distributions. Journal of the American Statistical Association, 0, 1-27. tools:::Rd_expr_doi("10.1080/01621459.2023.2298036").
Huang, Z. (2022). KMD: Kernel Measure of Multi-Sample Dissimilarity. R package version 0.1.0, https://CRAN.R-project.org/package=KMD.
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")
MMD
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform KMD test
if(requireNamespace("KMD", quietly = TRUE)) {
KMD(X1, X2, n.perm = 100)
}
Run the code above in your browser using DataLab