KMD: Kernel Measure of Multi-Sample Dissimilarity (KMD)

Description

Calculates the kernel measure of multi-sample dissimilarity (KMD) and performs a permutation multi-sample test (Huang and Sen, 2023). The implementation here uses the KMD and KMD_test implementations from the KMD package.

Usage

KMD(X1, X2, ..., n.perm = 0, graph = "knn", k = ceiling(N/10), 
    kernel = "discrete", seed = 42)

Value

An object of class htest with the following components:

statistic: Observed value of the test statistic
p.value: Permutation / asymptotic p value
estimate: Estimated KMD value
alternative: The alternative hypothesis
method: Description of the test
data.name: The dataset names
graph: Graph used for calculation
k: Number of neighbors used if graph is the KNN graph.
kernel: Kernel used for calculation

Arguments

X1: First dataset as matrix or data.frame
X2: Second dataset as matrix or data.frame
...: Optionally more datasets as matrices or data.frames
n.perm: Number of permutations for permutation test (default: 0, no permutation test performed).
graph: Graph used in calculation of KMD. Possible options are "knn" (default) and "mst".
k: Number of neighbors for construction of k-nearest neighbor graph. Ignored for graph = "mst".
kernel: Kernel used in calculation of KMD. Can either be "discrete" (default) for use of the discrete kernel or a kernal matrix with numbers of rows and columns corresponding to the number of datasets. For the latter, the entry in the $i$-th row and $j$-th column corresponds to the kernel value $k(i,j)$.
seed: Random seed (default: 42)

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	No	Yes

Details

Given the pooled sample $Z_1, \dots, Z_N$ and the corresponding sample memberships $\Delta_1,\dots, \Delta_N$ let $\mathcal{G}$ be a geometric graph on $\mathcal{X}$ such that an edge between two points $Z_i$ and $Z_j$ in the pooled sample implies that $Z_i$ and $Z_j$ are close, e.g. $K$-nearest neighbor graph with $K\ge 1$ or MST. Denote by $(Z_i,Z_j)\in\mathcal{E}(\mathcal{G})$ that there is an edge in $\mathcal{G}$ connecting $Z_i$ and $Z_j$. Moreover, let $o_i$ be the out-degree of $Z_i$ in $\mathcal{G}$. Then an estimator for the KMD $\eta$ is defined as $$\hat{\eta} := \frac{\frac{1}{N} \sum_{i=1}^N \frac{1}{o_i} \sum_{j:(Z_i,Z_j)\in\mathcal{E}(\mathcal{G})} K(\Delta_i, \Delta_j) - \frac{1}{N(N-1)} \sum_{i\ne j} K(\Delta_i, \Delta_j)}{\frac{1}{N}\sum_{i=1}^N K(\Delta_i, \Delta_i) - \frac{1}{N(N-1)} \sum_{i\ne j} K(\Delta_i, \Delta_j)}.$$

Euclidean distances are used for computing the KNN graph (ties broken at random) and the MST.

For n.perm == 0, an asymptotic test using the asymptotic normal approximation of the null distribution is performed. For this, the KMD is standardized by the null mean and standard deviation. For n.perm > 0, a permutation test is performed, i.e. the observed KMD statistic is compared to the permutation KMD statistics.

The theoretical KMD of two distributions is zero if and only if the distributions coincide. It is upper bound by one. Therefore, low values of the empirical KMD indicate similarity and the test rejects for high values.

Huang and Sen (2023) recommend using the $k$-NN graph for its flexibility, but the choice of $k$ is unclear. Based on the simulation results in the original article, the recommended values are $k = 0.1 N$ for testing and $k = 1$ for estimation. For increasing power it is beneficial to choose large values of $k$, for consistency of the tests, $k = o(N / \log(N))$ together with a continuous distribution of inter-point distances is sufficient, i.e. $k$ cannot be chosen too large compared to $N$. On the other hand, in the context of estimating the KMD, choosing $k$ is a bias-variance trade-off with small values of $k$ decreasing the bias and larger values of $k$ decreasing the variance (for more details see discussion in Appendix D.3 of Huang and Sen (2023)).

This implementation is a wrapper function around the functions KMD and KMD_test that modifies the in- and output of those functions to match the other functions provided in this package. For more details see KMD and KMD_test.

References

Huang, Z. and Sen, B. (2023). A Kernel Measure of Dissimilarity between $M$ Distributions. Journal of the American Statistical Association, 0, 1-27. tools:::Rd_expr_doi("10.1080/01621459.2023.2298036").

Huang, Z. (2022). KMD: Kernel Measure of Multi-Sample Dissimilarity. R package version 0.1.0, https://CRAN.R-project.org/package=KMD.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

Examples

Run this code

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform KMD test 
if(requireNamespace("KMD", quietly = TRUE)) {
  KMD(X1, X2, n.perm = 100)
}

Run the code above in your browser using DataLab