Learn R Programming

CCI (version 0.3.6)

test.gen: Generate the Test Statistic or Null Distribution Using Permutation

Description

This function generates the test statistic or a null distribution through permutation for conditional independence testing. It supports various machine learning methods, including random forests, extreme gradient boosting, and allows for custom metric functions and model fitting functions.

Usage

test.gen(
  formula,
  data,
  method = "rf",
  metric = "RMSE",
  nperm = 160,
  subsample = 1,
  p = 0.5,
  nrounds = 600,
  mtry = NULL,
  nthread = 1,
  permutation = FALSE,
  robust = TRUE,
  metricfunc = NULL,
  mlfunc = NULL,
  progress = TRUE,
  center = TRUE,
  scale = TRUE,
  eps = 1e-15,
  k = 15,
  positive = NULL,
  kernel = "optimal",
  distance = 2,
  ...
)

Value

A list containing the test distribution.

Arguments

formula

Formula specifying the relationship between dependent and independent variables.

data

Data frame. The data containing the variables used.

method

Character. The modeling method to be used. Options include "xgboost" for gradient boosting, or "rf" for random forests or "svm" for Support Vector Machine.

metric

Character. The type of metric: can be "RMSE", "Kappa" or "LogLoss". Default is 'RMSE'

nperm

Integer. The number of generated Monte Carlo samples. Default is 160.

subsample

Numeric. The proportion of the data to be used for subsampling. Default is 1 (no subsampling).

p

Numeric. The proportion of the data to be used for training. The remaining data will be used for testing. Default is 0.5.

nrounds

Integer. The number of rounds (trees) for methods like 'xgboost' and 'rf'. Default is 600.

mtry

Integer. The number of variables to possibly split at in each node for method 'rf'. Default is the rounded down square root of numbers of columns in data.

nthread

Integer. The number of threads to use for parallel processing. Only relevant for methods 'rf' and 'xgboost'. Default is 1.

permutation

Logical. Whether to perform permutation of the 'X' variable. Used to generate a null distribution. Default is FALSE.

robust

Logical. If TRUE, automatically performs stratified permutation if all conditional variables are factor or categorical. Default is TRUE.

metricfunc

Function. A custom metric function provided by the user. It must take arguments: actual, predictions, and optionally ..., and return a single numeric performance value.

mlfunc

Function. A custom machine learning function provided by the user. The function must have the arguments: formula, data, train_indices, test_indices, and ..., and return a single value performance metric. Default is NULL.

progress

Logical. A logical value indicating whether to show a progress bar during when building the null distribution. Default is TRUE.

center

Logical. If TRUE, the data is centered before model fitting. Default is TRUE.

scale

Logical. If TRUE, the data is scaled before model fitting. Default is TRUE.

eps

Numeric. A small value added to avoid division by zero. Only relevant for method 'KNN'. Default is 1e-15.

k

Integer. The number of nearest neighbors for the "KNN" method. Default is 15.

positive

Character vector. Only relevant for method 'KNN'. Specifies which levels of a factor variable should be treated as positive class in classification tasks. Default is NULL.

kernel

Character. Only relevant for method 'KNN'. Specifies the kernel type for method option "KNN" . Possible choices are "rectangular" (which is standard unweighted knn), "triangular", "epanechnikov" (or beta(2,2)), "biweight" (or beta(3,3)), "triweight" (or beta(4,4)), "cos", "inv", "gaussian" and "optimal". Default is "optimal".

distance

Numeric. Parameter of Minkowski distance for the "KNN" method. Default is 2.

...

Additional arguments to pass to the machine learning wrapper functions wrapper_xgboost, wrapper_ranger, wrapper_knn and wrapper_svm, or to a custom-built wrapper function.

Examples

Run this code
set.seed(123)
data <- data.frame(x1 = rnorm(100),
x2 = rnorm(100),
x3 = rnorm(100),
x4 = rnorm(100),
y = rnorm(100))
result <- test.gen(formula = y ~ x1 | x2 + x3 + x4,
                   metric = "RMSE",
                   data = data)
hist(result$distribution)

Run the code above in your browser using DataLab