test.gen: Generate the Test Statistic or Null Distribution Using Permutation

Description

This function generates the test statistic or a null distribution through permutation for conditional independence testing. It supports various machine learning methods, including random forests, extreme gradient boosting, and allows for custom metric functions and model fitting functions.

Usage

test.gen(
  formula,
  data,
  method = "rf",
  metric = "RMSE",
  nperm = 160,
  subsample = 1,
  p = 0.5,
  nrounds = 600,
  mtry = NULL,
  nthread = 1,
  permutation = FALSE,
  robust = TRUE,
  metricfunc = NULL,
  mlfunc = NULL,
  progress = TRUE,
  center = TRUE,
  scale = TRUE,
  eps = 1e-15,
  k = 15,
  positive = NULL,
  kernel = "optimal",
  distance = 2,
  ...
)

Value

A list containing the test distribution.

Arguments

formula: Formula specifying the relationship between dependent and independent variables.
data: Data frame. The data containing the variables used.
method: Character. The modeling method to be used. Options include "xgboost" for gradient boosting, or "rf" for random forests or "svm" for Support Vector Machine.
metric: Character. The type of metric: can be "RMSE", "Kappa" or "LogLoss". Default is 'RMSE'
nperm: Integer. The number of generated Monte Carlo samples. Default is 160.
subsample: Numeric. The proportion of the data to be used for subsampling. Default is 1 (no subsampling).
p: Numeric. The proportion of the data to be used for training. The remaining data will be used for testing. Default is 0.5.
nrounds: Integer. The number of rounds (trees) for methods like 'xgboost' and 'rf'. Default is 600.
mtry: Integer. The number of variables to possibly split at in each node for method 'rf'. Default is the rounded down square root of numbers of columns in data.
nthread: Integer. The number of threads to use for parallel processing. Only relevant for methods 'rf' and 'xgboost'. Default is 1.
permutation: Logical. Whether to perform permutation of the 'X' variable. Used to generate a null distribution. Default is FALSE.
robust: Logical. If TRUE, automatically performs stratified permutation if all conditional variables are factor or categorical. Default is TRUE.
metricfunc: Function. A custom metric function provided by the user. It must take arguments: actual, predictions, and optionally ..., and return a single numeric performance value.
mlfunc: Function. A custom machine learning function provided by the user. The function must have the arguments: formula, data, train_indices, test_indices, and ..., and return a single value performance metric. Default is NULL.
progress: Logical. A logical value indicating whether to show a progress bar during when building the null distribution. Default is TRUE.
center: Logical. If TRUE, the data is centered before model fitting. Default is TRUE.
scale: Logical. If TRUE, the data is scaled before model fitting. Default is TRUE.
eps: Numeric. A small value added to avoid division by zero. Only relevant for method 'KNN'. Default is 1e-15.
k: Integer. The number of nearest neighbors for the "KNN" method. Default is 15.
positive: Character vector. Only relevant for method 'KNN'. Specifies which levels of a factor variable should be treated as positive class in classification tasks. Default is NULL.
kernel: Character. Only relevant for method 'KNN'. Specifies the kernel type for method option "KNN" . Possible choices are "rectangular" (which is standard unweighted knn), "triangular", "epanechnikov" (or beta(2,2)), "biweight" (or beta(3,3)), "triweight" (or beta(4,4)), "cos", "inv", "gaussian" and "optimal". Default is "optimal".
distance: Numeric. Parameter of Minkowski distance for the "KNN" method. Default is 2.
...: Additional arguments to pass to the machine learning wrapper functions wrapper_xgboost, wrapper_ranger, wrapper_knn and wrapper_svm, or to a custom-built wrapper function.

Examples

Run this code

set.seed(123)
data <- data.frame(x1 = rnorm(100),
x2 = rnorm(100),
x3 = rnorm(100),
x4 = rnorm(100),
y = rnorm(100))
result <- test.gen(formula = y ~ x1 | x2 + x3 + x4,
                   metric = "RMSE",
                   data = data)
hist(result$distribution)

Run the code above in your browser using DataLab