GGRL: Decision-Tree Based Measure of Dataset Distance and Two-Sample Test

Description

Calculates Decision-Tree Based Measure of Dataset Distance by Ganti et al. (2002).

Usage

GGRL(X1, X2, target1 = "y", target2 = "y", n.perm = 0, m = 1, diff.fun = f.a, 
      agg.fun = sum, tune = TRUE, k = 5, n.eval = 100, seed = 42, ...)
GGRLCat(X1, X2, target1 = "y", target2 = "y", n.perm = 0, m = 1, diff.fun = f.aCat, 
        agg.fun = sum, tune = TRUE, k = 5, n.eval = 100, seed = 42, ...)
f.a(sec.parti, X1, X2)
f.s(sec.parti, X1, X2)
f.aCat(sec.parti, X1, X2)
f.sCat(sec.parti, X1, X2)

Value

An object of class htest with the following components:

statistic: Observed value of the test statistic
p.value: Permutation p value
alternative: The alternative hypothesis
method: Description of the test
data.name: The dataset names

Arguments

X1: First dataset as matrix or data.frame
X2: Second dataset as matrix or data.frame
target1: Character specifying the column name of the class variable in the first dataset (default: "y")
target2: Character specifying the column name of the class variable in the second dataset (default: "y")
n.perm: Number of permutations for permuation test (default: 0, no permutation test performed)
m: subsampling rate for Bootstrap test (default: 1). Ganti et al. (2002) suggest that 0.2-0.3 is sufficient in many cases. Ignored if n.perm <= 0.
diff.fun: Difference function as function (default: f.a, absolute difference). Other options: f.s (scaled difference), user specified function that takes greatest common refinement (GCR) partition and both datasets as input and returns vector of difference values for each section in the partition.
agg.fun: Aggregate function (default: sum). Other options are max, or user specified function that takes output of diff.fun and aggregates it into a single value. Note that only for sum it has been shown that the GCR is optimal.
tune: Should the decision tree parameters be tuned? (default: TRUE)
k: Number of folds used in cross-validation for parameter tuning (default: 5). Ignored if tune = FALSE.
n.eval: Number of evaluations for random search used for parameter tuning (default: 100). Ignored if tune = FALSE.
seed: Random seed (default: 42)
...: Further arguments passed to rpart. Ignored if tune = TRUE.
sec.parti: Intersected partition as output by calculateGCR, i.e. a list containing the intersected partition and each partition on its own as dataframes with limits for each variable.

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	Yes	No

Details

The method first calculates the greatest common refinement (GCR), that is the intersection of the sample space partitions induced by a decision tree fit to the first dataset and a decision tree fit to the second dataset. The proportions of samples falling into each section of the GCR is calculated for each dataset. These proportions are compared using a difference function and the results of this are aggregated by the aggregate function.

The implementation uses rpart for fitting classification trees to each dataset.

best.rpart is used for hyperparameter tuning if tune = TRUE. The parameters are tuned using cross-validation and random search. The parameter minsplit is tuned over 2^(1:7), minbucket is tuned over 2^(0:6) and cp is tuned over 10^seq(-4, -1, by = 0.001).

Pre-implemented methods for the difference function are $$f_a(\kappa_1, \kappa_2, n_1, n_2) = |\frac{\kappa_1}{n_1} - \frac{\kappa_2}{n_2}|, $$ and $$f_s(\kappa_1, \kappa_2, n_1, n_2) = \frac{|\frac{\kappa_1}{n_1} - \frac{\kappa_2}{n_2}|}{(\frac{\kappa_1}{n_1} + \frac{\kappa_2}{n_2}) / 2}, \text{ if }\kappa_1+\kappa_2>0,$$$$= 0 \text{ otherwise,}$$ where $\kappa_i$ is the number of observations from dataset $i$ in the respective region of the greatest common refinement and $n_i$ are the sample sizes, $i = 1, 2$.

The aggregate function aggregates the results of the difference function over all regions in the greatest common refinement.

References

Ganti, V., Gehrke, J., Ramakrishnan, R. and Loh W.-Y. (2002). A Framework for Measuring Differences in Data Characteristics, Journal of Computer and System Sciences, 64(3), tools:::Rd_expr_doi("10.1006/jcss.2001.1808").

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

Examples

Run this code

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
y1 <- rbinom(100, 1, 1 / (1 + exp(1 - X1 %*% rep(0.5, 10))))
y2 <- rbinom(100, 1, 1 / (1 + exp(1 - X2 %*% rep(0.7, 10))))
X1 <- data.frame(X = X1, y = y1)
X2 <- data.frame(X = X2, y = y2)
# Calculate Ganti et al. statistic (without tuning and testing due to runtime)
if(requireNamespace("rpart", quietly = TRUE)) {
  GGRL(X1, X2, "y", "y", tune = FALSE)
}

# Categorical case
set.seed(1234)
X1 <- data.frame(X1 = factor(sample(letters[1:5], 1000, TRUE)), 
                 X2 = factor(sample(letters[1:4], 1000, TRUE)), 
                 X3 = factor(sample(letters[1:3], 1000, TRUE)), 
                 y = sample(0:1, 100, TRUE))
X2 <- data.frame(X1 = factor(sample(letters[1:5], 1000, TRUE, 1:5)), 
                 X2 = factor(sample(letters[1:4], 1000, TRUE, 1:4)), 
                 X3 = factor(sample(letters[1:3], 1000, TRUE, 1:3)), 
                 y = sample(0:1, 100, TRUE))
# Calculate Ganti et al. statistic (without tuning and testing due to runtime)
if(requireNamespace("rpart", quietly = TRUE)) {
  GGRLCat(X1, X2, "y", "y", tune = FALSE)
}

Run the code above in your browser using DataLab