HMN: Random Forest Based Two-Sample Test

Description

Performs the random forest based two-sample test proposed by Hediger et al. (2022). The implementation here uses the hypoRF implementation from the hypoRF package.

Usage

HMN(X1, X2, n.perm = 0, statistic = "PerClassOOB", normal.approx = FALSE, 
    seed = 42, ...)

Value

An object of class htest with the following components:

statistic: Observed value of the test statistic
parameter: Paremeter(s) of the null distribution
p.value: Asymptotic p value
alternative: The alternative hypothesis
method: Description of the test
data.name: The dataset names
val: The OOB statistic values for the permuted data (for n.perm > 0)
varest: The estimated variance of the OOB statistic values for the permuted data (for n.perm > 0)
importance_ranking: Variable importance (for importance = "impurity")
cutoff: The quantile of the importance distribution at level \(\alpha\)

Arguments

X1: First dataset as matrix or data.frame
X2: Second dataset as matrix or data.frame
n.perm: Number of permutations for permutation test (default: 0, binomial test is performed).
statistic: Character specifying the test statistic. Possible options are "PerClassOOB" (default) corresponding to the sum of out-of-bag (OOB) per class errors, and "OverallOOB" corresponding to the overall OOB error.
normal.approx: Should a normal approximation be used in the permutation test procedure? (default: FALSE)
seed: Random seed (default: 42)
...: Arguments passed to ranger

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	Yes	No

Details

For the test, a random forest is fitted to the pooled dataset where the target variable is the original dataset membership. The test statistic is either the overall out-of-bag classification accuracy or the sum or mean of the per-class out-of-bag errors for the permutation test. For the asymptotic test (n.perm = 0), the pooled dataset is split into a training and test set and the test statistic is either the overall classification error on the test set or the mean of the per-class classification errors on the test set. In the former case, a binomial test is performed, in the latter case, a Wald test is performed. If the underlying distributions coincide, classification errors close to chance level are expected. The test rejects for small classification errors.

Note that the per class OOB statistic differs for the permutation test and approximate test: for the permutation test, the sum of the per class OOB errors is returned, for the asymptotic version, the standardized sum is returned.

This implementation is a wrapper function around the function hypoRF that modifies the in- and output of that function to match the other functions provided in this package. For more details see hypoRF.

References

Hediger, S., Michel, L., Näf, J. (2022). On the use of random forest for two-sample testing. Computational Statistics & Data Analysis, 170, 107435, tools:::Rd_expr_doi("10.1016/j.csda.2022.107435").

Simon, H., Michel, L., Näf, J. (2021). hypoRF: Random Forest Two-Sample Tests. R package version 1.0.0,https://CRAN.R-project.org/package=hypoRF.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")

Examples

Run this code

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform random forest based test (low number of permutations due to runtime, 
# should be chosen considerably higher in practice) 
if(requireNamespace("hypoRF", quietly = TRUE)) {
  HMN(X1, X2, n.perm = 10)
}

Run the code above in your browser using DataLab