cf: Cross-Fitting with Model Selection and Log Loss Evaluation

Description

Trains and evaluates predictive models using cross-fitting across nfold folds, supporting multiple learner types. Outputs out-of-fold predictions and computes log_loss for each hyperparameter tuning round to select the best-performing model.

Usage

cf(
  Y,
  X,
  nfold,
  R,
  foldid,
  cf_model,
  sub_set = rep(TRUE, length(Y)),
  custom_model_fun = NULL
)

Value

A list containing:

models: (Currently a placeholder) List of trained models per fold and tuning round.
predictions: List of out-of-fold predictions for each of the 5 tuning rounds.
log_losses: Numeric vector of log loss values for each tuning round.
best_rounds_index: Integer index (1–5) of the round achieving the lowest log_loss.
best_rounds_log_losses: Minimum log_loss value achieved across rounds.
best_rounds_prediction: Vector of out-of-fold predictions from the best tuning round.

Arguments

Y: Numeric or factor vector. The response variable, either binary (0/1) or continuous. Only labelled observations (where R = 1) are used.
X: Matrix or data frame. Predictor variables used for model training.
nfold: Integer. Number of cross-fitting folds.
R: Binary vector. Indicator of labelled data: 1 = labelled, 0 = unlabelled.
foldid: Integer vector. Fold assignments for cross-fitting (length equal to the full dataset).
cf_model: Character string. Specifies the model type. Must be one of "xgboost", "bspline", or "randomforest".
sub_set: Logical vector. Indicates which labelled samples to include in training.
custom_model_fun: A logical or function. If NULL or FALSE, bypasses adaptive-LASSO feature selection. Otherwise, enables two-stage tuning inside compute_parameter(). Defaults to all TRUE.

Details

The function supports three learner types:

xgboost: Gradient-boosted trees, tuning gamma across rounds.
bspline: Logistic regression using B-spline basis expansions, tuning the number of knots.
randomforest: Random forests, tuning nodesize.

Cross-fitting ensures that model evaluation is based on out-of-fold predictions, reducing overfitting. log_loss is used as the evaluation metric to identify the best hyperparameter setting.

Examples

Run this code

set.seed(123)
N <- 200
X <- matrix(rnorm(N * 5), nrow = N, ncol = 5)

# Simulate treatment assignment
A <- rbinom(N, 1, plogis(X[, 1] - 0.5 * X[, 2]))

# Simulate outcome
Y_full <- rbinom(N, 1, plogis(0.5 * X[, 1] - 0.25 * X[, 3]))

# Introduce some missingness to simulate semi-supervised data
Y <- Y_full
Y[sample(1:N, size = N/4)] <- NA  # 25% missing

# Create R vector (labelled = 1, unlabelled = 0)
R <- ifelse(!is.na(Y), 1, 0)

# Cross-validation fold assignment
foldid <- sample(rep(1:5, length.out = N))

# Run cf with glm model
result <- cf(Y = Y, X = X, nfold = 5, R = R, foldid = foldid, cf_model = "glm")

# Examine output
print(result$log_losses)
print(result$best_rounds_index)

Run the code above in your browser using DataLab