SC: Graph-Based Multi-Sample Test

Description

Performs the graph-based multi-sample test for high-dimensional data proposed by Song and Chen (2022). The implementation here uses the gtestsmulti implementation from the gTestsMulti package.

Usage

SC(X1, X2, ..., n.perm = 0, dist.fun = stats::dist, graph.fun = MST, 
    dist.args = NULL, graph.args = NULL, type = "S", seed = 42)

Value

An object of class htest with the following components:

statistic: Observed value of the test statistic
p.value: Permutation p value (only if n.perm > 0)
estimate: Estimated KMD value
alternative: The alternative hypothesis
method: Description of the test
data.name: The dataset names

Arguments

X1: First dataset as matrix or data.frame
X2: Second dataset as matrix or data.frame
...: Optionally more datasets as matrices or data.frames
n.perm: Number of permutations for permutation test (default: 0, no permutation test performed)
dist.fun: Function for calculating a distance matrix on the pooled dataset (default: stats::dist, Euclidean distance).
graph.fun: Function for calculating a similarity graph using the distance matrix on the pooled sample (default: MST, Minimum Spanning Tree).
dist.args: Named list of further arguments passed to dist.fun (default: NULL).
graph.args: Named list of further arguments passed to graph.fun (default: NULL).
type: Character specifying the test statistic to use. Possible options are "S" (default) and "SA". See details.
seed: Random seed (default: 42)

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	No	Yes

Details

Two multi-sample test statistics are defined by Song and Chen (2022) based on a similarity graph. The first one is defined as $$S = S_W + S_B, \text{ where}$$ $$S_W = (R_W - \text{E}(R_W))^T \Sigma_W^{-1}(R_W - \text{E}(R_W)),$$ $$S_B = (R_B - \text{E}(R_B))^T \Sigma_W^{-1}(R_B - \text{E}(R_B)),$$ with $R_W$ denoting the vector of within-sample edge counts and $R_B$ the vector of between-sample edge counts. Expectations and covariance matrix are calculated under the null.

The second statistic is defined as $$S_A = (R_A - \text{E}(R_A))^T \Sigma_W^{-1}(R_A - \text{E}(R_A)), $$ where $R_A$ is the vector of all linearly independent edge counts, i.e. the edge counts for all pairs of samples except the last pair $k-1$ and $k$.

This implementation is a wrapper function around the function gtestsmulti that modifies the in- and output of that function to match the other functions provided in this package. For more details see the gtestsmulti.

References

Song, H. and Chen, H. (2022). New graph-based multi-sample tests for high-dimensional and non- Euclidean data. tools:::Rd_expr_doi("10.48550/arXiv.2205.13787")

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

Examples

Run this code

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform Song and Chen test 
if(requireNamespace("gTestsMulti", quietly = TRUE)) {
  SC(X1, X2, n.perm = 100)
  SC(X1, X2, n.perm = 100, type = "SA")
}

Run the code above in your browser using DataLab