Performs the generalized edge-count two-sample test for multivariate data proposed by Chen and Friedman (2017). The implementation here uses the g.tests
implementation from the gTests package.
CF(X1, X2, dist.fun = stats::dist, graph.fun = MST, n.perm = 0,
dist.args = NULL, graph.args = NULL, seed = NULL)
An object of class htest
with the following components:
Observed value of the test statistic
Degrees of freedom for \(\chi^2\) distribution under \(H_0\) (only for asymptotic test)
Asymptotic or permutation p value
The alternative hypothesis
Description of the test
The dataset names
First dataset as matrix or data.frame
Second dataset as matrix or data.frame
Function for calculating a distance matrix on the pooled dataset (default: stats::dist
, Euclidean distance).
Function for calculating a similarity graph using the distance matrix on the pooled sample (default: MST
, Minimum Spanning Tree).
Number of permutations for permutation test (default: 0, asymptotic test is performed).
Named list of further arguments passed to dist.fun
(default: NULL
).
Named list of further arguments passed to graph.fun
(default: NULL
).
Random seed (default: NULL). A random seed will only be set if one is provided.
Target variable? | Numeric? | Categorical? | K-sample? |
No | Yes | No | No |
The test is an enhancement of the Friedman-Rafsky test (original edge-count test) that aims at detecting both location and scale alternatives. The test statistic is given as $$S = (R_1 - \mu_1, R_2 - \mu_2)\Sigma^{-1} \binom{R_1 - \mu_1}{R_2 - \mu_2}, \text{ where}$$ \(R_1\) and \(R_2\) denote the number of edges in the similarity graph connecting points within the first and second sample \(X_1\) and \(X_2\), respectively, \(\mu_1 = \text{E}_{H_0}(R_1)\), \(\mu_2 = \text{E}_{H_0}(R_2)\) and \(\Sigma\) is the covariance matrix of \(R_1\) and \(R_2\) under the null.
High values of the test statistic indicate dissimilarity of the datasets as the number of edges connecting points within the same sample is high meaning that points are more similar within the datasets than between the datasets.
For n.perm = 0
, an asymptotic test using the asymptotic \(\chi^2\) approximation of the null distribution is performed. For n.perm > 0
, a permutation test is performed.
This implementation is a wrapper function around the function g.tests
that modifies the in- and output of that function to match the other functions provided in this package. For more details see the g.tests
.
Chen, H. and Friedman, J.H. (2017). A New Graph-Based Two-Sample Test for Multivariate and Object Data. Journal of the American Statistical Association, 112(517), 397-409. tools:::Rd_expr_doi("10.1080/01621459.2016.1147356")
Chen, H., and Zhang, J. (2017). gTests: Graph-Based Two-Sample Tests. R package version 0.2, https://CRAN.R-project.org/package=gTests.
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")
FR
for the original edge-count test, CCS
for the weighted edge-count test, ZC
for the maxtype edge-count test, gTests
for performing all these edge-count tests at once, SH
for performing the Schilling-Henze nearest neighbor test,
CCS_cat
, FR_cat
, CF_cat
, ZC_cat
, and gTests_cat
for versions of the test for categorical data
set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform generalized edge-count test
if(requireNamespace("gTests", quietly = TRUE)) {
# Using MST
CF(X1, X2)
# Using 5-MST
CF(X1, X2, graph.args = list(K = 5))
}
Run the code above in your browser using DataLab