Learn R Programming

DataSimilarity (version 0.2.0)

Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

Description

A collection of methods for quantifying the similarity of two or more datasets, many of which can be used for two- or k-sample testing. It provides newly implemented methods as well as wrapper functions for existing methods that enable calling many different methods in a unified framework. The methods were selected from the review and comparison of Stolte et al. (2024) .

Copy Link

Version

Install

install.packages('DataSimilarity')

Monthly Downloads

168

Version

0.2.0

License

GPL (>= 3)

Maintainer

Marieke Stolte

Last Published

June 16th, 2025

Functions in DataSimilarity (0.2.0)

DataSimilarity

Dataset Similarity
CMDistance

Constrained Minimum Distance
CF_cat

Generalized Edge-Count Test for Discrete Data
DataSimilarity-package

tools:::Rd_package_title("DataSimilarity")
DISCOF

Distance Components (DISCO) Tests
DS

Rank-Based Energy Test (Deb and Sen, 2021)
DiProPerm

Direction-Projection-Permutation (DiProPerm) Test
DISCOB

Distance Components (DISCO) Tests
CF

Generalized Edge-Count Test
Cramer

Cramér Two-Sample Test
HamiltonPath

Shortest Hamilton path
GPK

Generalized Permutation-Based Kernel (GPK) Two-Sample Test
Jeffreys

Jeffreys Divergence
Energy

Energy Statistic and Test
GGRL

Decision-Tree Based Measure of Dataset Distance and Two-Sample Test
FStest

Multisample FS Test
HMN

Random Forest Based Two-Sample Test
FR_cat

Friedman-Rafsky Test for Discrete Data
KMD

Kernel Measure of Multi-Sample Dissimilarity (KMD)
FR

Friedman-Rafsky Test
LHZStatistic

Calculation of the Li et al. (2022) Empirical Characteristic Distance
MST

Minimum Spanning Tree (MST)
Petrie

Multisample Crossmatch (MCM) Test
MMCM

Multisample Mahalanobis Crossmatch (MMCM) Test
OTDD

Optimal Transport Dataset Distance
MMD

Maximum Mean Discrepancy (MMD) Test
NKT

Decision-Tree Based Measure of Dataset Similarity (Ntoutsi et al., 2008)
LHZ

Empirical Characteristic Distance
RItest

Multisample RI Test
MW

Nonparametric Graph-Based LP (GLP) Test
ZC

Maxtype Edge-Count Test
findSimilarityMethod

Selection of Appropriate Methods for Quantifying the Similarity of Datasets
Wasserstein

Wasserstein Distance Based Test
YMRZL

Yu et al. (2007) Two-Sample Test
engineerMetric

Engineer Metric
dipro.fun

Direction-Projection Functions for DiProPerm Test
ZC_cat

Maxtype Edge-Count Test for Discrete Data
SH

Schilling-Henze Nearest Neighbor Test
Rosenbaum

Rosenbaum Crossmatch Test
SC

Graph-Based Multi-Sample Test
gTests

Graph-Based Tests
kerTests

Generalized Permutation-Based Kernel (GPK) Two-Sample Test
gTests_cat

Graph-Based Tests for Discrete Data
rectPartition

Calculate a Rectangular Partition
method.table

List of Methods Included in the Package
knn

K-Nearest Neighbor Graph
stat.fun

Univariate Two-Sample Statistics for DiProPerm Test
gTestsMulti

Graph-Based Multi-Sample Test
C2ST

Classifier Two-Sample Test
BG

Biau and Gyorfi (2005) Two-sample Homogeneity Test
BQS

Barakat et al. (1996) Two-Sample Test
BG2

Biswas and Ghosh (2014) Two-Sample Test
Bahr

Bahr (1996) Multivariate Two-sample Test
CCS_cat

Weighted Edge-Count Two-Sample Test for Discrete Data
BallDivergence

Ball Divergence Based Two- or \(k\)-sample Test
BMG

Biswas et al. (2014) Two-sample Runs Test
BF

Baringhaus and Franz (2010) Rigid Motion Invariant Multivariate Two-sample Test
CCS

Weighted Edge-Count Two-Sample Test