CMDistance: Constrained Minimum Distance

Description

Calculates the Constrained Minimum Distance (Tatti, 2007) between two datasets.

Usage

CMDistance(X1, X2, binary = NULL, cov = FALSE,
            S.fun = function(x) as.numeric(as.character(x)), 
            cov.S = NULL, Omega = NULL, seed = 42)

Value

An object of class htest with the following components:

statistic: Observed value of the CM Distance
alternative: The alternative hypothesis
method: Description of the test
data.name: The dataset names
binary, cov, S.fun, cov.S, Omega: Input parameters

Arguments

X1: First dataset as matrix or data.frame
X2: Second dataset as matrix or data.frame
binary: Should the simplified form for binary data be used? (default: NULL, it is checked internally if each variable in the pooled dataset takes on exactly two distinct values)
cov: If the the binary version is used, should covariances in addition to means be used as features? (default: FALSE, corresponds to example 3 in Tatti (2007), TRUE corresponds to example 4). Ignored if binary = FALSE.
S.fun: Feature function (default: NULL). Should be supplied as a function that takes one observation vector as its input. Ignored if binary = TRUE (default: NULL).
cov.S: Covariance matix of feature function (default: NULL). Ignored if binary = TRUE.
Omega: Sample space as matrix (default: NULL, the sample space is derived from the data internally). Each row represents one value in the sample space. Used for calculating the covariance matrix if cov.S = NULL. Either cov.S or Omega must be given. Ignored if binary = TRUE.
seed: Random seed (default: 42)

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	No	Yes	No

Details

The constrained minimum (CM) distance is not a distance between distributions but rather a distance based on summaries. These summaries, called frequencies and denoted by $\theta$, are averages of feature functions $S$ taken over the dataset. The constrained minimum distance of two datasets $X_1$ and $X_2$ can be calculated as $$d_{CM}(X_1, X_2 |S)^2 = (\theta_1 - \theta_2)^T\text{Cov}^{-1}(S)(\theta_1 - \theta_2), $$ where $\theta_i = S(X_i)$ is the frequency with respect to the $i$-th dataset, $i = 1, 2$, and $$\text{Cov}(S) = \frac{1}{|\Omega|}\sum_{\omega\in\Omega} S(\omega)S(\omega)^T - \left(\frac{1}{|\Omega|}\sum_{\omega\in\Omega} S(\omega)\right)\left(\frac{1}{|\Omega|}\sum_{\omega\in\Omega} S(\omega)\right)^T,$$ where $\Omega$ denotes the sample space.

Note that the implementation can only handle limited dimensions of the sample space. The error message

"Error in rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) : invalid 'times' value"

occurs when the sample space becomes too large to enumerate all its elements. In case of binary data and $S$ chosen as a conjunction or parity function $T_{F}$ on a family of itemsets, the calculation of the CMD simplifies to $$d_{CM}(D_1, D_2 | S_{F}) = 2 ||\theta_1 - \theta_2||_2,$$ where $\theta_i = T_{F}(X_i), i = 1, 2,$ as the sample space and covariance matrix are known. In case of more than two categories, either the sample space or the covariance matrix of the feature function must be supplied.

Small values of the CM Distance indicate similarity between the datasets. No test is conducted.

References

Tatti, N. (2007). Distances between Data Sets Based on Summary Statistics. JMRL 8, 131-154.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

Examples

Run this code

# Test example 2 in Tatti (2007)
CMDistance(X1 = data.frame(c("C", "C", "C", "A")), 
           X2 = data.frame(c("C", "A", "B", "A")),
           binary = FALSE, S.fun = function(x) as.numeric(x == "C"),
           Omega = data.frame(c("A", "B", "C")))

# Demonstration of corrected calculation
X1bin <- matrix(sample(0:1, 100 * 3, replace = TRUE), ncol = 3)
X2bin <- matrix(sample(0:1, 100 * 3, replace = TRUE, prob = 1:2), ncol = 3)
CMDistance(X1bin, X2bin, binary = TRUE, cov = FALSE)
Omega <- expand.grid(0:1, 0:1, 0:1)
S.fun <- function(x) x
CMDistance(X1bin, X2bin, binary = FALSE, S.fun = S.fun, Omega = Omega)
CMDistance(X1bin, X2bin, binary = FALSE, S.fun = S.fun, cov.S = 0.5 * diag(3))
CMDistance(X1bin, X2bin, binary = FALSE, S.fun = S.fun, 
            cov.S = 0.5 * diag(3))$statistic * sqrt(2)

# Example for non-binary data
X1cat <- matrix(sample(1:4, 300, replace = TRUE), ncol = 3)
X2cat <- matrix(sample(1:4, 300, replace = TRUE, prob = 1:4), ncol = 3)
CMDistance(X1cat, X2cat, binary = FALSE, S.fun = S.fun, 
           Omega = expand.grid(1:4, 1:4, 1:4))
CMDistance(X1cat, X2cat, binary = FALSE, S.fun = function(x) as.numeric(x == 1), 
           Omega = expand.grid(1:4, 1:4, 1:4))
CMDistance(X1cat, X2cat, binary = FALSE, S.fun = function(x){ 
           c(x, x[1] * x[2], x[1] * x[3], x[2] * x[3])}, 
           Omega = expand.grid(1:4, 1:4, 1:4))

Run the code above in your browser using DataLab