Performs the random forest based two-sample test proposed by Hediger et al. (2022). The implementation here uses the hypoRF
implementation from the hypoRF package.
HMN(X1, X2, n.perm = 0, statistic = "PerClassOOB", normal.approx = FALSE,
seed = 42, ...)
An object of class htest
with the following components:
Observed value of the test statistic
Paremeter(s) of the null distribution
Asymptotic p value
The alternative hypothesis
Description of the test
The dataset names
The OOB statistic values for the permuted data (for n.perm > 0
)
The estimated variance of the OOB statistic values for the permuted data (for n.perm > 0
)
Variable importance (for importance = "impurity"
)
The quantile of the importance distribution at level \(\alpha\)
First dataset as matrix or data.frame
Second dataset as matrix or data.frame
Number of permutations for permutation test (default: 0, binomial test is performed).
Character specifying the test statistic. Possible options are "PerClassOOB"
(default) corresponding to the sum of out-of-bag (OOB) per class errors, and "OverallOOB"
corresponding to the overall OOB error.
Should a normal approximation be used in the permutation test procedure? (default: FALSE
)
Random seed (default: 42)
Arguments passed to ranger
Target variable? | Numeric? | Categorical? | K-sample? |
No | Yes | Yes | No |
For the test, a random forest is fitted to the pooled dataset where the target variable is the original dataset membership. The test statistic is either the overall out-of-bag classification accuracy or the sum or mean of the per-class out-of-bag errors for the permutation test. For the asymptotic test (n.perm = 0
), the pooled dataset is split into a training and test set and the test statistic is either the overall classification error on the test set or the mean of the per-class classification errors on the test set. In the former case, a binomial test is performed, in the latter case, a Wald test is performed. If the underlying distributions coincide, classification errors close to chance level are expected. The test rejects for small classification errors.
Note that the per class OOB statistic differs for the permutation test and approximate test: for the permutation test, the sum of the per class OOB errors is returned, for the asymptotic version, the standardized sum is returned.
This implementation is a wrapper function around the function hypoRF
that modifies the in- and output of that function to match the other functions provided in this package. For more details see hypoRF
.
Hediger, S., Michel, L., Näf, J. (2022). On the use of random forest for two-sample testing. Computational Statistics & Data Analysis, 170, 107435, tools:::Rd_expr_doi("10.1016/j.csda.2022.107435").
Simon, H., Michel, L., Näf, J. (2021). hypoRF: Random Forest Two-Sample Tests. R package version 1.0.0,https://CRAN.R-project.org/package=hypoRF.
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. tools:::Rd_expr_doi("10.1214/24-SS149")
ranger
, C2ST
, YMRZL
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform random forest based test (low number of permutations due to runtime,
# should be chosen considerably higher in practice)
if(requireNamespace("hypoRF", quietly = TRUE)) {
HMN(X1, X2, n.perm = 10)
}
Run the code above in your browser using DataLab