Learn R Programming

bioLeak (version 0.2.0)

fit_resample: Fit and evaluate with leakage guards over predefined splits

Description

Performs cross-validated model training and evaluation using leakage-protected preprocessing (.guard_fit) and user-specified learners.

Usage

fit_resample(
  x,
  outcome,
  splits,
  preprocess = list(impute = list(method = "median"), normalize = list(method =
    "zscore"), filter = list(var_thresh = 0, iqr_thresh = 0), fs = list(method = "none")),
  learner = c("glmnet", "ranger"),
  learner_args = list(),
  custom_learners = list(),
  metrics = c("auc", "pr_auc", "accuracy"),
  class_weights = NULL,
  positive_class = NULL,
  classification_threshold = 0.5,
  parallel = FALSE,
  refit = TRUE,
  seed = 1,
  split_cols = "auto",
  store_refit_data = TRUE
)

Value

A LeakFit S4 object containing:

splits

The LeakSplits object used for resampling.

metrics

Data.frame of per-fold, per-learner performance metrics with columns fold, learner, and one column per requested metric.

metric_summary

Data.frame summarizing metrics across folds for each learner with columns learner, and <metric>_mean and <metric>_sd for each requested metric.

audit

Data.frame with per-fold audit information including fold, n_train, n_test, learner, and features_final (number of features after preprocessing).

predictions

List of data.frames containing out-of-fold predictions with columns id (sample identifier), truth (true outcome), pred (predicted value or probability), fold, and learner. For classification tasks, includes pred_class. For multiclass, includes per-class probability columns.

preprocess

List of preprocessing state objects from each fold, storing imputation parameters, normalization statistics, and feature selection results.

learners

List of fitted model objects from each fold.

outcome

Character string naming the outcome variable.

task

Character string indicating the task type ("binomial", "multiclass", "gaussian", or "survival").

feature_names

Character vector of feature names after preprocessing.

info

List of additional metadata including hash, metrics_used, class_weights, positive_class, sample_ids, fold_status, refit, final_model (refitted model if refit = TRUE), final_preprocess, learner_names, and perm_refit_spec (for permutation-based audits).

Use summary() to print a formatted report, or access slots directly with @.

Arguments

x

SummarizedExperiment or matrix/data.frame

outcome

outcome column name (if x is SE or data.frame), or a length-2 character vector of time/event column names for survival outcomes.

splits

LeakSplits object from make_split_plan(), or an `rsample` rset/rsplit.

preprocess

list(impute, normalize, filter=list(...), fs) or a `recipes::recipe` object. When a recipe is supplied, the guarded preprocessing pipeline is bypassed and the recipe is prepped on training data only.

learner

parsnip model_spec (or list of model_spec objects) describing the model(s) to fit, or a `workflows::workflow`. For legacy use, a character vector of learner names (e.g., "glmnet", "ranger") or custom learner IDs is still supported.

learner_args

list of additional arguments passed to legacy learners (ignored when `learner` is a parsnip model_spec).

custom_learners

named list of custom learner definitions used only with legacy character learners. Each entry must contain fit and predict functions. The fit function should accept x, y, task, and weights, and return a model object. The predict function should accept object, newdata, and task. For binomial/regression/survival tasks it should return a numeric vector; for multiclass tasks it should return either class labels or a matrix/data.frame of class probabilities.

metrics

named list of metric functions, vector of metric names, or a `yardstick::metric_set`. When a yardstick metric set (or list of yardstick metric functions) is supplied, metrics are computed using yardstick with the positive class set to the second factor level.

class_weights

optional named numeric vector of weights for binomial or multiclass outcomes

positive_class

optional value indicating the positive class for binomial outcomes. When set, the outcome levels are reordered so that positive_class is treated as the positive class (level 2). If NULL, the second factor level is used.

classification_threshold

Numeric threshold in [0, 1] used to convert binomial probabilities into class predictions for pred_class and accuracy metrics. Ignored for non-binomial tasks.

parallel

logical, use future.apply for multicore execution

refit

logical, if TRUE retrain final model on full data

seed

integer, for reproducibility

split_cols

Optional named list/character vector or `"auto"` (default) overriding group/batch/study/time column names when `splits` is an rsample object and its attributes are missing. `"auto"` falls back to common metadata column names (e.g., `group`, `subject`, `batch`, `study`, `time`). Supported names are `group`, `batch`, `study`, and `time`.

store_refit_data

Logical; when TRUE (default), stores the original data and learner configuration inside the fit to enable refit-based permutation tests without manual `perm_refit_spec` setup.

Details

Preprocessing is fit on the training fold and applied to the test fold, preventing leakage from global imputation, scaling, or feature selection. When a `recipes::recipe` or `workflows::workflow` is supplied, the recipe is prepped on the training fold and baked on the test fold. For data.frame or matrix inputs, columns used to define splits (outcome, group, batch, study, time) are excluded from the predictor matrix. Use learner_args to pass model-specific arguments, either as a named list keyed by learner or a single list applied to all learners. For custom learners, learner_args[[name]] may be a list with fit and predict sublists to pass distinct arguments to each stage. For binomial tasks, predictions and metrics assume the positive class is the second factor level; use positive_class to control this. Use classification_threshold to change the probability cutoff used for class labels and accuracy. Parsnip learners must support probability predictions for binomial metrics (AUC/PR-AUC/accuracy) and multiclass log-loss when requested.

Examples

Run this code
set.seed(1)
df <- data.frame(
  subject = rep(1:10, each = 2),
  outcome = rbinom(20, 1, 0.5),
  x1 = rnorm(20),
  x2 = rnorm(20)
)
splits <- make_split_plan(df, outcome = "outcome",
                      mode = "subject_grouped", group = "subject", v = 5)

# glmnet learner (requires glmnet package)
fit <- fit_resample(df, outcome = "outcome", splits = splits,
                      learner = "glmnet", metrics = "auc")
summary(fit)

# Custom learner (logistic regression) - no extra packages needed
custom <- list(
  glm = list(
    fit = function(x, y, task, weights, ...) {
      stats::glm(y ~ ., data = as.data.frame(x),
                 family = stats::binomial(), weights = weights)
    },
    predict = function(object, newdata, task, ...) {
      as.numeric(stats::predict(object, newdata = as.data.frame(newdata), type = "response"))
    }
  )
)
fit2 <- fit_resample(df, outcome = "outcome", splits = splits,
                     learner = "glm", custom_learners = custom,
                     metrics = "accuracy")

summary(fit2)

Run the code above in your browser using DataLab