CCS: Weighted Edge-Count Two-Sample Test

Description

Performs the weighted edge-count two-sample test for multivariate data proposed by Chen, Chen and Su (2018). The test is intended for comparing two samples with unequal sample sizes. The implementation here uses the g.tests implementation from the gTests package.

Usage

CCS(X1, X2, dist.fun = stats::dist, graph.fun = MST, n.perm = 0, 
    dist.args = NULL, graph.args = NULL, seed = 42)

Value

An object of class htest with the following components:

statistic: Observed value of the test statistic
p.value: Asymptotic or permutation p value
alternative: The alternative hypothesis
method: Description of the test
data.name: The dataset names

Arguments

X1: First dataset as matrix or data.frame
X2: Second dataset as matrix or data.frame
dist.fun: Function for calculating a distance matrix on the pooled dataset (default: stats::dist, Euclidean distance).
graph.fun: Function for calculating a similarity graph using the distance matrix on the pooled sample (default: MST, Minimum Spanning Tree).
n.perm: Number of permutations for permutation test (default: 0, asymptotic test is performed).
dist.args: Named list of further arguments passed to dist.fun (default: NULL).
graph.args: Named list of further arguments passed to graph.fun (default: NULL).
seed: Random seed (default: 42)

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	No	No

Details

The test is an enhancement of the Friedman-Rafsky test (original edge-count test) that aims at improving the test's power for unequal sample sizes by weighting. The test statistic is given as $$Z_w = \frac{R_w - \text{E}_{H_0}(R_w)}{\sqrt{\text{Var}_{H_0}(R_w)}}, \text{ where}$$ $$R_w = \frac{n_1}{n_1+n_2} R_1 + \frac{n_2}{n_1+n_2} R_2$$ and $R_1$ and $R_2$ denote the number of edges in the similarity graph connecting points within the first and second sample $X_1$ and $X_2$, respectively.

High values of the test statistic indicate dissimilarity of the datasets as the number of edges connecting points within the same sample is high meaning that points are more similar within the datasets than between the datasets.

For n.perm = 0, an asymptotic test using the asymptotic normal approximation of the null distribution is performed. For n.perm > 0, a permutation test is performed.

This implementation is a wrapper function around the function g.tests that modifies the in- and output of that function to match the other functions provided in this package. For more details see the g.tests.

References

Chen, H., Chen, X. and Su, Y. (2018). A Weighted Edge-Count Two-Sample Test for Multivariate and Object Data. Journal of the American Statistical Association, 113(523), 1146-1155, tools:::Rd_expr_doi("10.1080/01621459.2017.1307757")

Chen, H., and Zhang, J. (2017). gTests: Graph-Based Two-Sample Tests. R package version 0.2, https://CRAN.R-project.org/package=gTests.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

Examples

Run this code

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform weighted edge-count test
if(requireNamespace("gTests", quietly = TRUE)) {
  CCS(X1, X2)
}

Run the code above in your browser using DataLab