Learn R Programming

bioLeak (version 0.2.0)

audit_leakage: Audit leakage and confounding

Description

Computes a post-hoc leakage audit for a resampled model fit. The audit (1) compares observed cross-validated performance to a label-permutation null (by default refitting when data are available; otherwise using fixed predictions), (2) tests whether fold assignments are associated with batch or study metadata (confounding by design), (3) scans features for unusually strong outcome proxies, and (4) flags duplicate or near-duplicate samples in a reference feature matrix.

The returned [LeakAudit] summarizes these diagnostics. It relies on the stored predictions, splits, and optional metadata; it does not refit models unless `perm_refit = TRUE` (or `perm_refit = "auto"` with a valid `perm_refit_spec`). Results are conditional on the chosen metric and supplied metadata/features and should be interpreted as diagnostics, not proof of leakage or its absence.

Usage

audit_leakage(
  fit,
  metric = c("auc", "pr_auc", "accuracy", "macro_f1", "log_loss", "rmse", "cindex"),
  B = 200,
  perm_stratify = FALSE,
  perm_refit = "auto",
  perm_refit_auto_max = 200,
  perm_refit_spec = NULL,
  perm_mode = NULL,
  time_block = c("circular", "stationary"),
  block_len = NULL,
  include_z = TRUE,
  ci_method = c("if", "bootstrap"),
  boot_B = 400,
  parallel = FALSE,
  seed = 1,
  return_perm = TRUE,
  batch_cols = NULL,
  coldata = NULL,
  X_ref = NULL,
  target_scan = TRUE,
  target_scan_multivariate = TRUE,
  target_scan_multivariate_B = 100,
  target_scan_multivariate_components = 10,
  target_scan_multivariate_interactions = TRUE,
  target_threshold = 0.9,
  feature_space = c("raw", "rank"),
  sim_method = c("cosine", "pearson"),
  sim_threshold = 0.995,
  nn_k = 50,
  max_pairs = 5000,
  duplicate_scope = c("train_test", "all"),
  learner = NULL
)

Value

A LeakAudit S4 object containing:

fit

The LeakFit object that was audited.

permutation_gap

One-row data.frame with columns: metric_obs (observed cross-validated metric), perm_mean (mean of permuted metrics), perm_sd (standard deviation), gap (observed minus permuted mean, or vice versa for loss metrics), z (standardized gap), p_value (permutation p-value), and n_perm (number of permutations). A large positive gap and small p-value suggest the model captures signal beyond random label assignment.

perm_values

Numeric vector of length B containing the metric value from each permutation. Useful for plotting the null distribution. Empty if return_perm = FALSE.

batch_assoc

Data.frame of chi-square association tests between fold assignment and batch/study metadata, with columns: variable, stat (chi-square statistic), df (degrees of freedom), pval, and cramer_v (effect size). Small p-values indicate potential confounding by design.

target_assoc

Data.frame of per-feature outcome associations with columns: feature, type ("numeric" or "categorical"), metric (AUC, correlation, eta_sq, or Cramer's V depending on task), value, score (scaled effect size), p_value, n, and flag (TRUE if score >= target_threshold). Flagged features may indicate target leakage.

duplicates

Data.frame of near-duplicate sample pairs with columns: i, j (row indices in X_ref), sim (similarity value), and in_train_test (whether the pair appears in train vs test). Duplicates in train and test can inflate performance.

trail

List capturing audit parameters and intermediate results for reproducibility, including metric, B, seed, perm_stratify, perm_refit, and timing info.

info

List with additional metadata including multivariate scan results when target_scan_multivariate = TRUE.

Use summary() to print a human-readable report, or access slots directly with @.

Arguments

fit

A [LeakFit] object from [fit_resample()] containing cross-validated predictions and split metadata. If predictions include learner IDs for multiple models, you must supply `learner` to select one; if learner IDs are absent, the audit uses all predictions and may mix learners.

metric

Character scalar. One of `"auc"`, `"pr_auc"`, `"accuracy"`, `"macro_f1"`, `"log_loss"`, `"rmse"`, or `"cindex"`. Defaults to `"auc"`. This controls the observed performance statistic, the permutation null, and the sign of the reported gap.

B

Integer scalar. Number of permutations used to build the null distribution (default 200). Larger values reduce Monte Carlo error but increase runtime.

perm_stratify

Logical scalar or `"auto"`. If TRUE (default), permutations are stratified within each fold (factor levels; numeric outcomes are binned into quantiles when enough non-missing values are available). If FALSE, no stratification is used. Stratification only applies when `coldata` supplies the outcome; otherwise labels are shuffled within each fold.

perm_refit

Logical scalar or `"auto"`. If FALSE, permutations keep predictions fixed and shuffle labels (association test). If TRUE, each permutation refits the model on permuted outcomes using `perm_refit_spec`. Refit-based permutations are slower but better approximate a full null distribution. The default is `"auto"`, which refits only when `perm_refit_spec` is provided and `B` is less than or equal to `perm_refit_auto_max`; otherwise it falls back to fixed-prediction permutations.

perm_refit_auto_max

Integer scalar. Maximum `B` allowed for `perm_refit = "auto"` to trigger refitting. Defaults to 200.

perm_refit_spec

List of inputs used when `perm_refit = TRUE`. Required elements: `x` (data used for fitting) and `learner` (parsnip model_spec, workflow, or legacy learner). Optional elements: `outcome` (defaults to `fit@outcome`), `preprocess`, `learner_args`, `custom_learners`, `class_weights`, `positive_class`, and `parallel`. Survival outcomes are not supported for refit-based permutations.

perm_mode

Optional character scalar to override the permutation mode used for restricted shuffles. One of `"subject_grouped"`, `"batch_blocked"`, `"study_loocv"`, or `"time_series"`. Defaults to the split metadata when available (including rsample-derived modes).

time_block

Character scalar, `"circular"` or `"stationary"`. Controls block permutation for `time_series` splits; ignored for other split modes. Default is `"circular"`.

block_len

Integer scalar or NULL. Block length for time-series permutations. NULL selects `max(5, floor(0.1 * fold_size))`. Larger values preserve more temporal structure and yield a more conservative null.

include_z

Logical scalar. If TRUE (default), include the z-score for the permutation gap when a standard error is available; if FALSE, `z` is NA.

ci_method

Character scalar, `"if"` or `"bootstrap"`. Controls how the standard error and confidence interval for the permutation gap are estimated. Default is `"if"`. `"if"` uses an influence-function estimate when available; `"bootstrap"` resamples permutation values `boot_B` times. Failed estimates yield NA.

boot_B

Integer scalar. Number of bootstrap resamples when `ci_method = "bootstrap"` (default 400). Larger values are more stable but slower.

parallel

Logical scalar. If TRUE and `future.apply` is available, permutations run in parallel. Results should match sequential execution. Default is FALSE.

seed

Integer scalar. Random seed used for permutations and bootstrap resampling; changing it changes the randomization but not the observed metric. Default is 1.

return_perm

Logical scalar. If TRUE (default), stores the permutation distribution in `audit@perm_values`. Set FALSE to reduce memory use.

batch_cols

Character vector. Names of `coldata` columns to test for association with fold assignment. If NULL, defaults to any of `"batch"`, `"plate"`, `"center"`, `"site"`, `"study"` found in `coldata`. Changing this controls which batch tests appear in `batch_assoc`.

coldata

Optional data.frame of sample-level metadata. Rows must align to prediction ids via row names, a `row_id` column, or row order. Used to build restricted permutations (when the outcome column is present), compute batch associations, and supply outcomes for target scans. If NULL, uses `fit@splits@info$coldata` when available. If alignment fails, restricted permutations are disabled with a warning.

X_ref

Optional numeric matrix/data.frame (samples x features). Used for duplicate detection and the target leakage scan. If NULL, uses `fit@info$X_ref` when available. Rows must align to sample ids (split order) via row names, a `row_id` column, or row order; misalignment disables these checks.

target_scan

Logical scalar. If TRUE (default), computes per-feature outcome associations on `X_ref` and flags proxy features; if FALSE, or if `X_ref`/outcomes are unavailable, `target_assoc` is empty. Not available for survival outcomes.

target_scan_multivariate

Logical scalar. If TRUE (default), fits a simple multivariate/interaction model on `X_ref` using the stored splits and reports a permutation-based score/p-value. This is slower and only implemented for binomial and gaussian tasks.

target_scan_multivariate_B

Integer scalar. Number of permutations for the multivariate scan (default 100). Larger values stabilize the p-value.

target_scan_multivariate_components

Integer scalar. Maximum number of principal components used in the multivariate scan (default 10).

target_scan_multivariate_interactions

Logical scalar. If TRUE (default), adds pairwise interactions among the top components in the multivariate scan.

target_threshold

Numeric scalar in (0,1). Threshold applied to the association score used to flag proxy features. Higher values are stricter. Default is 0.9.

feature_space

Character scalar, `"raw"` or `"rank"`. If `"rank"`, each row of `X_ref` is rank-transformed before similarity calculations. This affects duplicate detection only. Default is `"raw"`.

sim_method

Character scalar, `"cosine"` or `"pearson"`. Similarity metric for duplicate detection. `"pearson"` row-centers before cosine. Default is `"cosine"`.

sim_threshold

Numeric scalar in (0,1). Similarity cutoff for reporting duplicate pairs (default 0.995). Higher values yield fewer pairs.

nn_k

Integer scalar. For large datasets (`n > 3000`) with `RANN` installed, checks only the nearest `nn_k` neighbors per row. Larger values increase sensitivity but slow the search. Ignored when full comparisons are used. Default is 50.

max_pairs

Integer scalar. Maximum number of duplicate pairs returned. If more pairs are found, only the most similar are kept. This does not affect permutation results. Default is 5000.

duplicate_scope

Character scalar. One of `"train_test"` (default) or `"all"`. `"train_test"` retains only near-duplicate pairs that appear in train vs test in at least one repeat; `"all"` reports all near-duplicate pairs in `X_ref` regardless of fold assignment.

learner

Optional character scalar. When predictions include multiple learner IDs, selects the learner to audit. If NULL and multiple learners are present, the function errors; if predictions lack learner IDs, this argument is ignored with a warning. Default is NULL.

Details

The `permutation_gap` slot reports `metric_obs`, `perm_mean`, `perm_sd`, `gap`, `z`, `p_value`, and `n_perm`. The gap is defined as `metric_obs - perm_mean` for metrics where higher is better (AUC, PR-AUC, accuracy, macro-F1, C-index) and `perm_mean - metric_obs` for RMSE/log-loss. By default, `perm_refit = "auto"` refits models when refit data are available and `B` is not too large; otherwise it keeps predictions fixed and shuffles labels. Fixed-prediction permutations quantify prediction-label association rather than a full refit null. Set `perm_refit = FALSE` to force fixed predictions, or `perm_refit = TRUE` (with `perm_refit_spec`) to always refit.

`batch_assoc` contains chi-square tests between fold assignment and each `batch_cols` variable (`stat`, `df`, `pval`, `cramer_v`). `target_assoc` reports feature-wise outcome associations on `X_ref`; numeric features use AUC (binomial), `eta_sq` (multiclass), or correlation (gaussian), while categorical features use Cramer's V (binomial/multiclass) or `eta_sq` from a one-way ANOVA (gaussian). The `score` column is the scaled effect size used for flagging (`flag = score >= target_threshold`). The univariate target leakage scan can miss multivariate proxies, interaction leakage, or features not included in `X_ref`. The multivariate scan (enabled by default for supported tasks) adds a model-based proxy check but still only covers features present in `X_ref`.

Duplicate detection compares rows of `X_ref` using the chosen `sim_method` (cosine on L2-normalized rows, or Pearson via row-centering), optionally after rank transformation (`feature_space = "rank"`). By default, `duplicate_scope = "train_test"` filters to pairs that appear in train vs test in at least one repeat; set `duplicate_scope = "all"` to include within-fold duplicates. The `duplicates` slot returns index pairs and similarity values for near-duplicate samples. Only duplicates present in `X_ref` can be detected, and checks are skipped if inputs cannot be aligned to splits.

Examples

Run this code
set.seed(1)
df <- data.frame(
  subject = rep(1:6, each = 2),
  outcome = rbinom(12, 1, 0.5),
  x1 = rnorm(12),
  x2 = rnorm(12)
)

splits <- make_split_plan(df, outcome = "outcome",
                      mode = "subject_grouped", group = "subject", v = 3,
                      progress = FALSE)

custom <- list(
  glm = list(
    fit = function(x, y, task, weights, ...) {
      stats::glm(y ~ ., data = as.data.frame(x),
                 family = stats::binomial(), weights = weights)
    },
    predict = function(object, newdata, task, ...) {
      as.numeric(stats::predict(object,
                                newdata = as.data.frame(newdata),
                                type = "response"))
    }
  )
)

fit <- fit_resample(df, outcome = "outcome", splits = splits,
                    learner = "glm", custom_learners = custom,
                    metrics = "auc", refit = FALSE, seed = 1)

audit <- audit_leakage(fit, metric = "auc", B = 10,
                       X_ref = df[, c("x1", "x2")])

Run the code above in your browser using DataLab