Performs the weighted edge-count two-sample test for multivariate data proposed by Chen, Chen and Su (2018). The test is intended for comparing two samples with unequal sample sizes. The implementation here uses the g.tests
implementation from the gTests package.
CCS_cat(X1, X2, dist.fun, agg.type, graph.type = "mstree", K = 1, n.perm = 0,
seed = 42)
An object of class htest
with the following components:
Observed value of the test statistic
Asymptotic or permutation p value
The alternative hypothesis
Description of the test
The dataset names
First dataset as matrix or data.frame
Second dataset as matrix or data.frame
Function for calculating a distance matrix on the pooled dataset.
Character giving the method for aggregating over possible similarity graphs. Options are "u"
for union of possible similarity graphs and "a"
for averaging over test statistics calculated on possible similarity graphs.
Character specifying which similarity graph to use. Possible options are "mstree"
(default, Minimum Spanning Tree) and "nnlink"
(Nearest Neighbor Graph).
Parameter for graph (default: 1). If graph.type = "mstree"
, a K
-MST is constructed (K=1
is the classical MST). If graph.type = "nnlink"
, K
gives the number of neighbors considered in the K
-NN graph.
Number of permutations for permutation test (default: 0, asymptotic test is performed).
Random seed (default: 42)
Target variable? | Numeric? | Categorical? | K-sample? |
No | No | Yes | No |
The test is an enhancement of the Friedman-Rafsky test (original edge-count test) that aims at improving the test's power for unequal sample sizes by weighting. The test statistic is given as $$Z_w = \frac{R_w - \text{E}_{H_0}(R_w)}{\sqrt{\text{Var}_{H_0}(R_w)}}, \text{ where}$$ $$R_w = \frac{n_1}{n_1+n_2} R_1 + \frac{n_2}{n_1+n_2} R_2$$ and \(R_1\) and \(R_2\) denote the number of edges in the similarity graph connecting points within the first and second sample \(X_1\) and \(X_2\), respectively. For discrete data, the similarity graph used in the test is not necessarily unique. This can be solved by either taking a union of all optimal similarity graphs or averaging the test statistics over all optimal similarity graphs. For details, see Zhang and Chen (2022).
For n.perm = 0
, an asymptotic test using the asymptotic normal approximation of the null distribution is performed. For n.perm > 0
, a permutation test is performed.
This implementation is a wrapper function around the function g.tests
that modifies the in- and output of that function to match the other functions provided in this package. For more details see the g.tests
.
Chen, H., Chen, X. and Su, Y. (2018). A Weighted Edge-Count Two-Sample Test for Multivariate and Object Data. Journal of the American Statistical Association, 113(523), 1146 - 1155, tools:::Rd_expr_doi("10.1080/01621459.2017.1307757")
Zhang, J. and Chen, H. (2022). Graph-Based Two-Sample Tests for Data with Repeated Observations. Statistica Sinica 32, 391-415, tools:::Rd_expr_doi("10.5705/ss.202019.0116").
Chen, H., and Zhang, J. (2017). gTests: Graph-Based Two-Sample Tests. R package version 0.2, https://CRAN.R-project.org/package=gTests.
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")
FR_cat
for the original edge-count test, CF_cat
for the generalized edge-count test, ZC_cat
for the maxtype edge-count test, gTests_cat
for performing all these edge-count tests at once,
CCS
, FR
, CF
, ZC
, and gTests
for versions of the tests for continuous data, and SH
for performing the Schilling-Henze nearest neighbor test
# Draw some data
X1cat <- matrix(sample(1:4, 300, replace = TRUE), ncol = 3)
X2cat <- matrix(sample(1:4, 300, replace = TRUE, prob = 1:4), ncol = 3)
# Perform weighted edge-count test
if(requireNamespace("gTests", quietly = TRUE)) {
CCS_cat(X1cat, X2cat, dist.fun = function(x, y) sum(x != y), agg.type = "a")
}
Run the code above in your browser using DataLab