fit_resample: Fit and evaluate with leakage guards over predefined splits

Description

Performs cross-validated model training and evaluation using leakage-protected preprocessing (.guard_fit) and user-specified learners.

Usage

fit_resample(
  x,
  outcome,
  splits,
  preprocess = list(impute = list(method = "median"), normalize = list(method =
    "zscore"), filter = list(var_thresh = 0, iqr_thresh = 0), fs = list(method = "none")),
  learner = c("glmnet", "ranger"),
  learner_args = list(),
  custom_learners = list(),
  metrics = c("auc", "pr_auc", "accuracy"),
  class_weights = NULL,
  positive_class = NULL,
  classification_threshold = 0.5,
  parallel = FALSE,
  refit = TRUE,
  seed = 1,
  split_cols = "auto",
  store_refit_data = TRUE
)

Value

A LeakFit S4 object containing:

splits: The LeakSplits object used for resampling.
metrics: Data.frame of per-fold, per-learner performance metrics with columns fold, learner, and one column per requested metric.
metric_summary: Data.frame summarizing metrics across folds for each learner with columns learner, and <metric>_mean and <metric>_sd for each requested metric.
audit: Data.frame with per-fold audit information including fold, n_train, n_test, learner, and features_final (number of features after preprocessing).
predictions: List of data.frames containing out-of-fold predictions with columns id (sample identifier), truth (true outcome), pred (predicted value or probability), fold, and learner. For classification tasks, includes pred_class. For multiclass, includes per-class probability columns.
preprocess: List of preprocessing state objects from each fold, storing imputation parameters, normalization statistics, and feature selection results.
learners: List of fitted model objects from each fold.
outcome: Character string naming the outcome variable.
task: Character string indicating the task type ("binomial", "multiclass", "gaussian", or "survival").
feature_names: Character vector of feature names after preprocessing.
info: List of additional metadata including hash, metrics_used, class_weights, positive_class, sample_ids, fold_status, refit, final_model (refitted model if refit = TRUE), final_preprocess, learner_names, and perm_refit_spec (for permutation-based audits).

Use summary() to print a formatted report, or access slots directly with @.

Arguments

x: SummarizedExperiment or matrix/data.frame
outcome: outcome column name (if x is SE or data.frame), or a length-2 character vector of time/event column names for survival outcomes.
splits: LeakSplits object from make_split_plan(), or an `rsample` rset/rsplit.
preprocess: list(impute, normalize, filter=list(...), fs) or a `recipes::recipe` object. When a recipe is supplied, the guarded preprocessing pipeline is bypassed and the recipe is prepped on training data only.
learner: parsnip model_spec (or list of model_spec objects) describing the model(s) to fit, or a `workflows::workflow`. For legacy use, a character vector of learner names (e.g., "glmnet", "ranger") or custom learner IDs is still supported.
learner_args: list of additional arguments passed to legacy learners (ignored when `learner` is a parsnip model_spec).
custom_learners: named list of custom learner definitions used only with legacy character learners. Each entry must contain fit and predict functions. The fit function should accept x, y, task, and weights, and return a model object. The predict function should accept object, newdata, and task. For binomial/regression/survival tasks it should return a numeric vector; for multiclass tasks it should return either class labels or a matrix/data.frame of class probabilities.
metrics: named list of metric functions, vector of metric names, or a `yardstick::metric_set`. When a yardstick metric set (or list of yardstick metric functions) is supplied, metrics are computed using yardstick with the positive class set to the second factor level.
class_weights: optional named numeric vector of weights for binomial or multiclass outcomes
positive_class: optional value indicating the positive class for binomial outcomes. When set, the outcome levels are reordered so that positive_class is treated as the positive class (level 2). If NULL, the second factor level is used.
classification_threshold: Numeric threshold in [0, 1] used to convert binomial probabilities into class predictions for pred_class and accuracy metrics. Ignored for non-binomial tasks.
parallel: logical, use future.apply for multicore execution
refit: logical, if TRUE retrain final model on full data
seed: integer, for reproducibility
split_cols: Optional named list/character vector or `"auto"` (default) overriding group/batch/study/time column names when `splits` is an rsample object and its attributes are missing. `"auto"` falls back to common metadata column names (e.g., `group`, `subject`, `batch`, `study`, `time`). Supported names are `group`, `batch`, `study`, and `time`.
store_refit_data: Logical; when TRUE (default), stores the original data and learner configuration inside the fit to enable refit-based permutation tests without manual `perm_refit_spec` setup.

Details

Preprocessing is fit on the training fold and applied to the test fold, preventing leakage from global imputation, scaling, or feature selection. When a `recipes::recipe` or `workflows::workflow` is supplied, the recipe is prepped on the training fold and baked on the test fold. For data.frame or matrix inputs, columns used to define splits (outcome, group, batch, study, time) are excluded from the predictor matrix. Use learner_args to pass model-specific arguments, either as a named list keyed by learner or a single list applied to all learners. For custom learners, learner_args[[name]] may be a list with fit and predict sublists to pass distinct arguments to each stage. For binomial tasks, predictions and metrics assume the positive class is the second factor level; use positive_class to control this. Use classification_threshold to change the probability cutoff used for class labels and accuracy. Parsnip learners must support probability predictions for binomial metrics (AUC/PR-AUC/accuracy) and multiclass log-loss when requested.

Examples

Run this code

set.seed(1)
df <- data.frame(
  subject = rep(1:10, each = 2),
  outcome = rbinom(20, 1, 0.5),
  x1 = rnorm(20),
  x2 = rnorm(20)
)
splits <- make_split_plan(df, outcome = "outcome",
                      mode = "subject_grouped", group = "subject", v = 5)

# glmnet learner (requires glmnet package)
fit <- fit_resample(df, outcome = "outcome", splits = splits,
                      learner = "glmnet", metrics = "auc")
summary(fit)

# Custom learner (logistic regression) - no extra packages needed
custom <- list(
  glm = list(
    fit = function(x, y, task, weights, ...) {
      stats::glm(y ~ ., data = as.data.frame(x),
                 family = stats::binomial(), weights = weights)
    },
    predict = function(object, newdata, task, ...) {
      as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response"))
    }
  )
)
fit2 <- fit_resample(df, outcome = "outcome", splits = splits,
                     learner = "glm", custom_learners = custom,
                     metrics = "accuracy")

summary(fit2)

Run the code above in your browser using DataLab