Learn R Programming

DataSimilarity (version 0.1.1)

CMDistance: Constrained Minimum Distance

Description

Calculates the Constrained Minimum Distance (Tatti, 2007) between two datasets.

Usage

CMDistance(X1, X2, binary = NULL, cov = FALSE,
            S.fun = function(x) as.numeric(as.character(x)), 
            cov.S = NULL, Omega = NULL, seed = 42)

Value

An object of class htest with the following components:

statistic

Observed value of the CM Distance

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

binary, cov, S.fun, cov.S, Omega

Input parameters

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

binary

Should the simplified form for binary data be used? (default: NULL, it is checked internally if each variable in the pooled dataset takes on exactly two distinct values)

cov

If the the binary version is used, should covariances in addition to means be used as features? (default: FALSE, corresponds to example 3 in Tatti (2007), TRUE corresponds to example 4). Ignored if binary = FALSE.

S.fun

Feature function (default: NULL). Should be supplied as a function that takes one observation vector as its input. Ignored if binary = TRUE (default: NULL).

cov.S

Covariance matix of feature function (default: NULL). Ignored if binary = TRUE.

Omega

Sample space as matrix (default: NULL, the sample space is derived from the data internally). Each row represents one value in the sample space. Used for calculating the covariance matrix if cov.S = NULL. Either cov.S or Omega must be given. Ignored if binary = TRUE.

seed

Random seed (default: 42)

Applicability

Target variable?Numeric?Categorical?K-sample?
NoNoYesNo

Details

The constrained minimum (CM) distance is not a distance between distributions but rather a distance based on summaries. These summaries, called frequencies and denoted by \(\theta\), are averages of feature functions \(S\) taken over the dataset. The constrained minimum distance of two datasets \(X_1\) and \(X_2\) can be calculated as $$d_{CM}(X_1, X_2 |S)^2 = (\theta_1 - \theta_2)^T\text{Cov}^{-1}(S)(\theta_1 - \theta_2), $$ where \(\theta_i = S(X_i)\) is the frequency with respect to the \(i\)-th dataset, \(i = 1, 2\), and $$\text{Cov}(S) = \frac{1}{|\Omega|}\sum_{\omega\in\Omega} S(\omega)S(\omega)^T - \left(\frac{1}{|\Omega|}\sum_{\omega\in\Omega} S(\omega)\right)\left(\frac{1}{|\Omega|}\sum_{\omega\in\Omega} S(\omega)\right)^T,$$ where \(\Omega\) denotes the sample space.

Note that the implementation can only handle limited dimensions of the sample space. The error message

"Error in rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) : invalid 'times' value"

occurs when the sample space becomes too large to enumerate all its elements. In case of binary data and \(S\) chosen as a conjunction or parity function \(T_{F}\) on a family of itemsets, the calculation of the CMD simplifies to $$d_{CM}(D_1, D_2 | S_{F}) = 2 ||\theta_1 - \theta_2||_2,$$ where \(\theta_i = T_{F}(X_i), i = 1, 2,\) as the sample space and covariance matrix are known. In case of more than two categories, either the sample space or the covariance matrix of the feature function must be supplied.

Small values of the CM Distance indicate similarity between the datasets. No test is conducted.

References

Tatti, N. (2007). Distances between Data Sets Based on Summary Statistics. JMRL 8, 131-154.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

Examples

Run this code
# Test example 2 in Tatti (2007)
CMDistance(X1 = data.frame(c("C", "C", "C", "A")), 
           X2 = data.frame(c("C", "A", "B", "A")),
           binary = FALSE, S.fun = function(x) as.numeric(x == "C"),
           Omega = data.frame(c("A", "B", "C")))

# Demonstration of corrected calculation
X1bin <- matrix(sample(0:1, 100 * 3, replace = TRUE), ncol = 3)
X2bin <- matrix(sample(0:1, 100 * 3, replace = TRUE, prob = 1:2), ncol = 3)
CMDistance(X1bin, X2bin, binary = TRUE, cov = FALSE)
Omega <- expand.grid(0:1, 0:1, 0:1)
S.fun <- function(x) x
CMDistance(X1bin, X2bin, binary = FALSE, S.fun = S.fun, Omega = Omega)
CMDistance(X1bin, X2bin, binary = FALSE, S.fun = S.fun, cov.S = 0.5 * diag(3))
CMDistance(X1bin, X2bin, binary = FALSE, S.fun = S.fun, 
            cov.S = 0.5 * diag(3))$statistic * sqrt(2)

# Example for non-binary data
X1cat <- matrix(sample(1:4, 300, replace = TRUE), ncol = 3)
X2cat <- matrix(sample(1:4, 300, replace = TRUE, prob = 1:4), ncol = 3)
CMDistance(X1cat, X2cat, binary = FALSE, S.fun = S.fun, 
           Omega = expand.grid(1:4, 1:4, 1:4))
CMDistance(X1cat, X2cat, binary = FALSE, S.fun = function(x) as.numeric(x == 1), 
           Omega = expand.grid(1:4, 1:4, 1:4))
CMDistance(X1cat, X2cat, binary = FALSE, S.fun = function(x){ 
           c(x, x[1] * x[2], x[1] * x[3], x[2] * x[3])}, 
           Omega = expand.grid(1:4, 1:4, 1:4))

Run the code above in your browser using DataLab