Unlimited learning, half price | 50% off

Last chance! 50% off unlimited learning

Sale ends in


DataSimilarity (version 0.1.1)

CF: Generalized Edge-Count Test

Description

Performs the generalized edge-count two-sample test for multivariate data proposed by Chen and Friedman (2017). The implementation here uses the g.tests implementation from the gTests package.

Usage

CF(X1, X2, dist.fun = stats::dist, graph.fun = MST, n.perm = 0, 
    dist.args = NULL, graph.args = NULL, seed = 42)

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

parameter

Degrees of freedom for χ2 distribution under H0 (only for asymptotic test)

p.value

Asymptotic or permutation p value

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

dist.fun

Function for calculating a distance matrix on the pooled dataset (default: stats::dist, Euclidean distance).

graph.fun

Function for calculating a similarity graph using the distance matrix on the pooled sample (default: MST, Minimum Spanning Tree).

n.perm

Number of permutations for permutation test (default: 0, asymptotic test is performed).

dist.args

Named list of further arguments passed to dist.fun (default: NULL).

graph.args

Named list of further arguments passed to graph.fun (default: NULL).

seed

Random seed (default: 42)

Applicability

Target variable?Numeric?Categorical?K-sample?
NoYesNoNo

Details

The test is an enhancement of the Friedman-Rafsky test (original edge-count test) that aims at detecting both location and scale alternatives. The test statistic is given as S=(R1μ1,R2μ2)Σ1(R1μ1R2μ2), where R1 and R2 denote the number of edges in the similarity graph connecting points within the first and second sample X1 and X2, respectively, μ1=EH0(R1), μ2=EH0(R2) and Σ is the covariance matrix of R1 and R2 under the null.

High values of the test statistic indicate dissimilarity of the datasets as the number of edges connecting points within the same sample is high meaning that points are more similar within the datasets than between the datasets.

For n.perm = 0, an asymptotic test using the asymptotic χ2 approximation of the null distribution is performed. For n.perm > 0, a permutation test is performed.

This implementation is a wrapper function around the function g.tests that modifies the in- and output of that function to match the other functions provided in this package. For more details see the g.tests.

References

Chen, H. and Friedman, J.H. (2017). A New Graph-Based Two-Sample Test for Multivariate and Object Data. Journal of the American Statistical Association, 112(517), 397-409. tools:::Rd_expr_doi("10.1080/01621459.2016.1147356")

Chen, H., and Zhang, J. (2017). gTests: Graph-Based Two-Sample Tests. R package version 0.2, https://CRAN.R-project.org/package=gTests.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

See Also

FR for the original edge-count test, CCS for the weighted edge-count test, ZC for the maxtype edge-count test, gTests for performing all these edge-count tests at once, SH for performing the Schilling-Henze nearest neighbor test, CCS_cat, FR_cat, CF_cat, ZC_cat, and gTests_cat for versions of the test for categorical data

Examples

Run this code
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform generalized edge-count test
if(requireNamespace("gTests", quietly = TRUE)) {
  CF(X1, X2)
}

Run the code above in your browser using DataLab