audit_leakage: Audit leakage and confounding

Description

Computes a post-hoc leakage audit for a resampled model fit. The audit (1) compares observed cross-validated performance to a label-permutation null (by default refitting when data are available; otherwise using fixed predictions), (2) tests whether fold assignments are associated with batch or study metadata (confounding by design), (3) scans features for unusually strong outcome proxies, and (4) flags duplicate or near-duplicate samples in a reference feature matrix.

The returned [LeakAudit] summarizes these diagnostics. It relies on the stored predictions, splits, and optional metadata; it does not refit models unless `perm_refit = TRUE` (or `perm_refit = "auto"` with a valid `perm_refit_spec`). Results are conditional on the chosen metric and supplied metadata/features and should be interpreted as diagnostics, not proof of leakage or its absence.

Usage

audit_leakage(
  fit,
  metric = c("auc", "pr_auc", "accuracy", "macro_f1", "log_loss", "rmse", "cindex"),
  B = 200,
  perm_stratify = FALSE,
  perm_refit = "auto",
  perm_refit_auto_max = 200,
  perm_refit_spec = NULL,
  perm_mode = NULL,
  time_block = c("circular", "stationary"),
  block_len = NULL,
  include_z = TRUE,
  ci_method = c("if", "bootstrap"),
  boot_B = 400,
  parallel = FALSE,
  seed = 1,
  return_perm = TRUE,
  batch_cols = NULL,
  coldata = NULL,
  X_ref = NULL,
  target_scan = TRUE,
  target_scan_multivariate = TRUE,
  target_scan_multivariate_B = 100,
  target_scan_multivariate_components = 10,
  target_scan_multivariate_interactions = TRUE,
  target_threshold = 0.9,
  feature_space = c("raw", "rank"),
  sim_method = c("cosine", "pearson"),
  sim_threshold = 0.995,
  nn_k = 50,
  max_pairs = 5000,
  duplicate_scope = c("train_test", "all"),
  learner = NULL
)

Value

A LeakAudit S4 object containing:

fit: The LeakFit object that was audited.
permutation_gap: One-row data.frame with columns: metric_obs (observed cross-validated metric), perm_mean (mean of permuted metrics), perm_sd (standard deviation), gap (observed minus permuted mean, or vice versa for loss metrics), z (standardized gap), p_value (permutation p-value), and n_perm (number of permutations). A large positive gap and small p-value suggest the model captures signal beyond random label assignment.
perm_values: Numeric vector of length B containing the metric value from each permutation. Useful for plotting the null distribution. Empty if return_perm = FALSE.
batch_assoc: Data.frame of chi-square association tests between fold assignment and batch/study metadata, with columns: variable, stat (chi-square statistic), df (degrees of freedom), pval, and cramer_v (effect size). Small p-values indicate potential confounding by design.
target_assoc: Data.frame of per-feature outcome associations with columns: feature, type ("numeric" or "categorical"), metric (AUC, correlation, eta_sq, or Cramer's V depending on task), value, score (scaled effect size), p_value, n, and flag (TRUE if score >= target_threshold). Flagged features may indicate target leakage.
duplicates: Data.frame of near-duplicate sample pairs with columns: i, j (row indices in X_ref), sim (similarity value), and in_train_test (whether the pair appears in train vs test). Duplicates in train and test can inflate performance.
trail: List capturing audit parameters and intermediate results for reproducibility, including metric, B, seed, perm_stratify, perm_refit, and timing info.
info: List with additional metadata including multivariate scan results when target_scan_multivariate = TRUE.

Use summary() to print a human-readable report, or access slots directly with @.

Arguments

fit: A [LeakFit] object from [fit_resample()] containing cross-validated predictions and split metadata. If predictions include learner IDs for multiple models, you must supply `learner` to select one; if learner IDs are absent, the audit uses all predictions and may mix learners.
metric: Character scalar. One of `"auc"`, `"pr_auc"`, `"accuracy"`, `"macro_f1"`, `"log_loss"`, `"rmse"`, or `"cindex"`. Defaults to `"auc"`. This controls the observed performance statistic, the permutation null, and the sign of the reported gap.
B: Integer scalar. Number of permutations used to build the null distribution (default 200). Larger values reduce Monte Carlo error but increase runtime.
perm_stratify: Logical scalar or `"auto"`. If TRUE (default), permutations are stratified within each fold (factor levels; numeric outcomes are binned into quantiles when enough non-missing values are available). If FALSE, no stratification is used. Stratification only applies when `coldata` supplies the outcome; otherwise labels are shuffled within each fold.
perm_refit: Logical scalar or `"auto"`. If FALSE, permutations keep predictions fixed and shuffle labels (association test). If TRUE, each permutation refits the model on permuted outcomes using `perm_refit_spec`. Refit-based permutations are slower but better approximate a full null distribution. The default is `"auto"`, which refits only when `perm_refit_spec` is provided and `B` is less than or equal to `perm_refit_auto_max`; otherwise it falls back to fixed-prediction permutations.
perm_refit_auto_max: Integer scalar. Maximum `B` allowed for `perm_refit = "auto"` to trigger refitting. Defaults to 200.
perm_refit_spec: List of inputs used when `perm_refit = TRUE`. Required elements: `x` (data used for fitting) and `learner` (parsnip model_spec, workflow, or legacy learner). Optional elements: `outcome` (defaults to `fit@outcome`), `preprocess`, `learner_args`, `custom_learners`, `class_weights`, `positive_class`, and `parallel`. Survival outcomes are not supported for refit-based permutations.
perm_mode: Optional character scalar to override the permutation mode used for restricted shuffles. One of `"subject_grouped"`, `"batch_blocked"`, `"study_loocv"`, or `"time_series"`. Defaults to the split metadata when available (including rsample-derived modes).
time_block: Character scalar, `"circular"` or `"stationary"`. Controls block permutation for `time_series` splits; ignored for other split modes. Default is `"circular"`.
block_len: Integer scalar or NULL. Block length for time-series permutations. NULL selects `max(5, floor(0.1 * fold_size))`. Larger values preserve more temporal structure and yield a more conservative null.
include_z: Logical scalar. If TRUE (default), include the z-score for the permutation gap when a standard error is available; if FALSE, `z` is NA.
ci_method: Character scalar, `"if"` or `"bootstrap"`. Controls how the standard error and confidence interval for the permutation gap are estimated. Default is `"if"`. `"if"` uses an influence-function estimate when available; `"bootstrap"` resamples permutation values `boot_B` times. Failed estimates yield NA.
boot_B: Integer scalar. Number of bootstrap resamples when `ci_method = "bootstrap"` (default 400). Larger values are more stable but slower.
parallel: Logical scalar. If TRUE and `future.apply` is available, permutations run in parallel. Results should match sequential execution. Default is FALSE.
seed: Integer scalar. Random seed used for permutations and bootstrap resampling; changing it changes the randomization but not the observed metric. Default is 1.
return_perm: Logical scalar. If TRUE (default), stores the permutation distribution in `audit@perm_values`. Set FALSE to reduce memory use.
batch_cols: Character vector. Names of `coldata` columns to test for association with fold assignment. If NULL, defaults to any of `"batch"`, `"plate"`, `"center"`, `"site"`, `"study"` found in `coldata`. Changing this controls which batch tests appear in `batch_assoc`.
coldata: Optional data.frame of sample-level metadata. Rows must align to prediction ids via row names, a `row_id` column, or row order. Used to build restricted permutations (when the outcome column is present), compute batch associations, and supply outcomes for target scans. If NULL, uses `fit@splits@info$coldata` when available. If alignment fails, restricted permutations are disabled with a warning.
X_ref: Optional numeric matrix/data.frame (samples x features). Used for duplicate detection and the target leakage scan. If NULL, uses `fit@info$X_ref` when available. Rows must align to sample ids (split order) via row names, a `row_id` column, or row order; misalignment disables these checks.
target_scan: Logical scalar. If TRUE (default), computes per-feature outcome associations on `X_ref` and flags proxy features; if FALSE, or if `X_ref`/outcomes are unavailable, `target_assoc` is empty. Not available for survival outcomes.
target_scan_multivariate: Logical scalar. If TRUE (default), fits a simple multivariate/interaction model on `X_ref` using the stored splits and reports a permutation-based score/p-value. This is slower and only implemented for binomial and gaussian tasks.
target_scan_multivariate_B: Integer scalar. Number of permutations for the multivariate scan (default 100). Larger values stabilize the p-value.
target_scan_multivariate_components: Integer scalar. Maximum number of principal components used in the multivariate scan (default 10).
target_scan_multivariate_interactions: Logical scalar. If TRUE (default), adds pairwise interactions among the top components in the multivariate scan.
target_threshold: Numeric scalar in (0,1). Threshold applied to the association score used to flag proxy features. Higher values are stricter. Default is 0.9.
feature_space: Character scalar, `"raw"` or `"rank"`. If `"rank"`, each row of `X_ref` is rank-transformed before similarity calculations. This affects duplicate detection only. Default is `"raw"`.
sim_method: Character scalar, `"cosine"` or `"pearson"`. Similarity metric for duplicate detection. `"pearson"` row-centers before cosine. Default is `"cosine"`.
sim_threshold: Numeric scalar in (0,1). Similarity cutoff for reporting duplicate pairs (default 0.995). Higher values yield fewer pairs.
nn_k: Integer scalar. For large datasets (`n > 3000`) with `RANN` installed, checks only the nearest `nn_k` neighbors per row. Larger values increase sensitivity but slow the search. Ignored when full comparisons are used. Default is 50.
max_pairs: Integer scalar. Maximum number of duplicate pairs returned. If more pairs are found, only the most similar are kept. This does not affect permutation results. Default is 5000.
duplicate_scope: Character scalar. One of `"train_test"` (default) or `"all"`. `"train_test"` retains only near-duplicate pairs that appear in train vs test in at least one repeat; `"all"` reports all near-duplicate pairs in `X_ref` regardless of fold assignment.
learner: Optional character scalar. When predictions include multiple learner IDs, selects the learner to audit. If NULL and multiple learners are present, the function errors; if predictions lack learner IDs, this argument is ignored with a warning. Default is NULL.

Details

The `permutation_gap` slot reports `metric_obs`, `perm_mean`, `perm_sd`, `gap`, `z`, `p_value`, and `n_perm`. The gap is defined as `metric_obs - perm_mean` for metrics where higher is better (AUC, PR-AUC, accuracy, macro-F1, C-index) and `perm_mean - metric_obs` for RMSE/log-loss. By default, `perm_refit = "auto"` refits models when refit data are available and `B` is not too large; otherwise it keeps predictions fixed and shuffles labels. Fixed-prediction permutations quantify prediction-label association rather than a full refit null. Set `perm_refit = FALSE` to force fixed predictions, or `perm_refit = TRUE` (with `perm_refit_spec`) to always refit.

`batch_assoc` contains chi-square tests between fold assignment and each `batch_cols` variable (`stat`, `df`, `pval`, `cramer_v`). `target_assoc` reports feature-wise outcome associations on `X_ref`; numeric features use AUC (binomial), `eta_sq` (multiclass), or correlation (gaussian), while categorical features use Cramer's V (binomial/multiclass) or `eta_sq` from a one-way ANOVA (gaussian). The `score` column is the scaled effect size used for flagging (`flag = score >= target_threshold`). The univariate target leakage scan can miss multivariate proxies, interaction leakage, or features not included in `X_ref`. The multivariate scan (enabled by default for supported tasks) adds a model-based proxy check but still only covers features present in `X_ref`.

Duplicate detection compares rows of `X_ref` using the chosen `sim_method` (cosine on L2-normalized rows, or Pearson via row-centering), optionally after rank transformation (`feature_space = "rank"`). By default, `duplicate_scope = "train_test"` filters to pairs that appear in train vs test in at least one repeat; set `duplicate_scope = "all"` to include within-fold duplicates. The `duplicates` slot returns index pairs and similarity values for near-duplicate samples. Only duplicates present in `X_ref` can be detected, and checks are skipped if inputs cannot be aligned to splits.

Examples

Run this code

set.seed(1)
df <- data.frame(
  subject = rep(1:6, each = 2),
  outcome = rbinom(12, 1, 0.5),
  x1 = rnorm(12),
  x2 = rnorm(12)
)

splits <- make_split_plan(df, outcome = "outcome",
                      mode = "subject_grouped", group = "subject", v = 3,
                      progress = FALSE)

custom <- list(
  glm = list(
    fit = function(x, y, task, weights, ...) {
      stats::glm(y ~ ., data = as.data.frame(x),
                 family = stats::binomial(), weights = weights)
    },
    predict = function(object, newdata, task, ...) {
      as.numeric(stats::predict(object,
                                newdata = as.data.frame(newdata),
                                type = "response"))
    }
  )
)

fit <- fit_resample(df, outcome = "outcome", splits = splits,
                    learner = "glm", custom_learners = custom,
                    metrics = "auc", refit = FALSE, seed = 1)

audit <- audit_leakage(fit, metric = "auc", B = 10,
                       X_ref = df[, c("x1", "x2")])

Run the code above in your browser using DataLab