Learn R Programming

DataSimilarity (version 0.1.1)

GGRL: Decision-Tree Based Measure of Dataset Distance and Two-Sample Test

Description

Calculates Decision-Tree Based Measure of Dataset Distance by Ganti et al. (2002).

Usage

GGRL(X1, X2, target1 = "y", target2 = "y", n.perm = 0, m = 1, diff.fun = f.a, 
      agg.fun = sum, tune = TRUE, k = 5, n.eval = 100, seed = 42, ...)
GGRLCat(X1, X2, target1 = "y", target2 = "y", n.perm = 0, m = 1, diff.fun = f.aCat, 
        agg.fun = sum, tune = TRUE, k = 5, n.eval = 100, seed = 42, ...)
f.a(sec.parti, X1, X2)
f.s(sec.parti, X1, X2)
f.aCat(sec.parti, X1, X2)
f.sCat(sec.parti, X1, X2)

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

p.value

Permutation p value

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

target1

Character specifying the column name of the class variable in the first dataset (default: "y")

target2

Character specifying the column name of the class variable in the second dataset (default: "y")

n.perm

Number of permutations for permuation test (default: 0, no permutation test performed)

m

subsampling rate for Bootstrap test (default: 1). Ganti et al. (2002) suggest that 0.2-0.3 is sufficient in many cases. Ignored if n.perm <= 0.

diff.fun

Difference function as function (default: f.a, absolute difference). Other options: f.s (scaled difference), user specified function that takes greatest common refinement (GCR) partition and both datasets as input and returns vector of difference values for each section in the partition.

agg.fun

Aggregate function (default: sum). Other options are max, or user specified function that takes output of diff.fun and aggregates it into a single value. Note that only for sum it has been shown that the GCR is optimal.

tune

Should the decision tree parameters be tuned? (default: TRUE)

k

Number of folds used in cross-validation for parameter tuning (default: 5). Ignored if tune = FALSE.

n.eval

Number of evaluations for random search used for parameter tuning (default: 100). Ignored if tune = FALSE.

seed

Random seed (default: 42)

...

Further arguments passed to rpart. Ignored if tune = TRUE.

sec.parti

Intersected partition as output by calculateGCR, i.e. a list containing the intersected partition and each partition on its own as dataframes with limits for each variable.

Applicability

Target variable?Numeric?Categorical?K-sample?
NoYesYesNo

Details

The method first calculates the greatest common refinement (GCR), that is the intersection of the sample space partitions induced by a decision tree fit to the first dataset and a decision tree fit to the second dataset. The proportions of samples falling into each section of the GCR is calculated for each dataset. These proportions are compared using a difference function and the results of this are aggregated by the aggregate function.

The implementation uses rpart for fitting classification trees to each dataset.

best.rpart is used for hyperparameter tuning if tune = TRUE. The parameters are tuned using cross-validation and random search. The parameter minsplit is tuned over 2^(1:7), minbucket is tuned over 2^(0:6) and cp is tuned over 10^seq(-4, -1, by = 0.001).

Pre-implemented methods for the difference function are $$f_a(\kappa_1, \kappa_2, n_1, n_2) = |\frac{\kappa_1}{n_1} - \frac{\kappa_2}{n_2}|, $$ and $$f_s(\kappa_1, \kappa_2, n_1, n_2) = \frac{|\frac{\kappa_1}{n_1} - \frac{\kappa_2}{n_2}|}{(\frac{\kappa_1}{n_1} + \frac{\kappa_2}{n_2}) / 2}, \text{ if }\kappa_1+\kappa_2>0,$$$$= 0 \text{ otherwise,}$$ where \(\kappa_i\) is the number of observations from dataset \(i\) in the respective region of the greatest common refinement and \(n_i\) are the sample sizes, \(i = 1, 2\).

The aggregate function aggregates the results of the difference function over all regions in the greatest common refinement.

References

Ganti, V., Gehrke, J., Ramakrishnan, R. and Loh W.-Y. (2002). A Framework for Measuring Differences in Data Characteristics, Journal of Computer and System Sciences, 64(3), tools:::Rd_expr_doi("10.1006/jcss.2001.1808").

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

See Also

NKT

Examples

Run this code
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
y1 <- rbinom(100, 1, 1 / (1 + exp(1 - X1 %*% rep(0.5, 10))))
y2 <- rbinom(100, 1, 1 / (1 + exp(1 - X2 %*% rep(0.7, 10))))
X1 <- data.frame(X = X1, y = y1)
X2 <- data.frame(X = X2, y = y2)
# Calculate Ganti et al. statistic (without tuning and testing due to runtime)
if(requireNamespace("rpart", quietly = TRUE)) {
  GGRL(X1, X2, "y", "y", tune = FALSE)
}

# Categorical case
set.seed(1234)
X1 <- data.frame(X1 = factor(sample(letters[1:5], 1000, TRUE)), 
                 X2 = factor(sample(letters[1:4], 1000, TRUE)), 
                 X3 = factor(sample(letters[1:3], 1000, TRUE)), 
                 y = sample(0:1, 100, TRUE))
X2 <- data.frame(X1 = factor(sample(letters[1:5], 1000, TRUE, 1:5)), 
                 X2 = factor(sample(letters[1:4], 1000, TRUE, 1:4)), 
                 X3 = factor(sample(letters[1:3], 1000, TRUE, 1:3)), 
                 y = sample(0:1, 100, TRUE))
# Calculate Ganti et al. statistic (without tuning and testing due to runtime)
if(requireNamespace("rpart", quietly = TRUE)) {
  GGRLCat(X1, X2, "y", "y", tune = FALSE)
}

Run the code above in your browser using DataLab