Learn R Programming

DataSimilarity (version 0.3.0)

Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

Description

A collection of methods for quantifying the similarity of two or more datasets, many of which can be used for two- or k-sample testing. It provides newly implemented methods as well as wrapper functions for existing methods that enable calling many different methods in a unified framework. The methods were selected from the review and comparison of Stolte et al. (2024) . An empirical comparison of the methods for categorical data was performed in Stolte et al. (2025) .

Copy Link

Version

Install

install.packages('DataSimilarity')

Monthly Downloads

159

Version

0.3.0

License

GPL (>= 3)

Maintainer

Marieke Stolte

Last Published

February 27th, 2026

Functions in DataSimilarity (0.3.0)

HMN

Random Forest Based Two-Sample Test
Cramer

Cramér Two-Sample Test
HamiltonPath

Shortest Hamilton path
DISCOF

Distance Components (DISCO) Tests
GGRL

Decision-Tree Based Measure of Dataset Distance and Two-Sample Test
MST

Minimum Spanning Tree (MST)
GPK

Generalized Permutation-Based Kernel (GPK) Two-Sample Test
CMDistance

Constrained Minimum Distance
DS

Rank-Based Energy Test (Deb and Sen, 2021)
CF_cat

Generalized Edge-Count Test for Discrete Data
CF

Generalized Edge-Count Test
FR

Friedman-Rafsky Test
Energy

Energy Statistic and Test
MW

Nonparametric Graph-Based LP (GLP) Test
DataSimilarity-package

tools:::Rd_package_title("DataSimilarity")
findSimilarityMethod

Selection of Appropriate Methods for Quantifying the Similarity of Datasets
MMD

Maximum Mean Discrepancy (MMD) Test
MMCM

Multisample Mahalanobis Crossmatch (MMCM) Test
engineerMetric

Engineer Metric
SH

Schilling-Henze Nearest Neighbor Test
KMD

Kernel Measure of Multi-Sample Dissimilarity (KMD)
OTDD

Optimal Transport Dataset Distance
Petrie

Multisample Crossmatch (MCM) Test
NKT

Decision-Tree Based Measure of Dataset Similarity (Ntoutsi et al., 2008)
RItest

Multisample RI Test
LHZStatistic

Calculation of the Li et al. (2022) Empirical Characteristic Distance
Jeffreys

Jeffreys Divergence
FStest

Multisample FS Test
FR_cat

Friedman-Rafsky Test for Discrete Data
LHZ

Empirical Characteristic Distance
method.table

List of Methods Included in the Package
ZC_cat

Maxtype Edge-Count Test for Discrete Data
Wasserstein

Wasserstein Distance Based Test
knn

K-Nearest Neighbor Graph
SC

Graph-Based Multi-Sample Test
Rosenbaum

Rosenbaum Crossmatch Test
gTestsMulti

Graph-Based Multi-Sample Test
dipro.fun

Direction-Projection Functions for DiProPerm Test
gTests

Graph-Based Tests
rectPartition

Calculate a Rectangular Partition
gTests_cat

Graph-Based Tests for Discrete Data
ZC

Maxtype Edge-Count Test
stat.fun

Univariate Two-Sample Statistics for DiProPerm Test
YMRZL

Yu et al. (2007) Two-Sample Test
kerTests

Generalized Permutation-Based Kernel (GPK) Two-Sample Test
CCS

Weighted Edge-Count Two-Sample Test
BMG

Biswas et al. (2014) Two-sample Runs Test
C2ST

Classifier Two-Sample Test
BQS

Barakat et al. (1996) Two-Sample Test
BallDivergence

Ball Divergence Based Two- or \(k\)-sample Test
BG2

Biswas and Ghosh (2014) Two-Sample Test
BF

Baringhaus and Franz (2010) Rigid Motion Invariant Multivariate Two-sample Test
Bahr

Bahr (1996) Multivariate Two-sample Test
CCS_cat

Weighted Edge-Count Two-Sample Test for Discrete Data
BG

Biau and Gyorfi (2005) Two-sample Homogeneity Test
DataSimilarity

Dataset Similarity
DISCOB

Distance Components (DISCO) Tests
DiProPerm

Direction-Projection-Permutation (DiProPerm) Test