Learn R Programming

SMMAL (version 0.0.5)

cf: Cross-Fitting with Model Selection and Log Loss Evaluation

Description

Trains and evaluates predictive models using cross-fitting across nfold folds, supporting multiple learner types. Outputs out-of-fold predictions and computes log_loss for each hyperparameter tuning round to select the best-performing model.

Usage

cf(
  Y,
  X,
  nfold,
  R,
  foldid,
  cf_model,
  sub_set = rep(TRUE, length(Y)),
  custom_model_fun = NULL
)

Value

A list containing:

models

(Currently a placeholder) List of trained models per fold and tuning round.

predictions

List of out-of-fold predictions for each of the 5 tuning rounds.

log_losses

Numeric vector of log loss values for each tuning round.

best_rounds_index

Integer index (1–5) of the round achieving the lowest log_loss.

best_rounds_log_losses

Minimum log_loss value achieved across rounds.

best_rounds_prediction

Vector of out-of-fold predictions from the best tuning round.

Arguments

Y

Numeric or factor vector. The response variable, either binary (0/1) or continuous. Only labelled observations (where R = 1) are used.

X

Matrix or data frame. Predictor variables used for model training.

nfold

Integer. Number of cross-fitting folds.

R

Binary vector. Indicator of labelled data: 1 = labelled, 0 = unlabelled.

foldid

Integer vector. Fold assignments for cross-fitting (length equal to the full dataset).

cf_model

Character string. Specifies the model type. Must be one of "xgboost", "bspline", or "randomforest".

sub_set

Logical vector. Indicates which labelled samples to include in training.

custom_model_fun

A logical or function. If NULL or FALSE, bypasses adaptive-LASSO feature selection. Otherwise, enables two-stage tuning inside compute_parameter(). Defaults to all TRUE.

Details

The function supports three learner types:

  • xgboost: Gradient-boosted trees, tuning gamma across rounds.

  • bspline: Logistic regression using B-spline basis expansions, tuning the number of knots.

  • randomforest: Random forests, tuning nodesize.

Cross-fitting ensures that model evaluation is based on out-of-fold predictions, reducing overfitting. log_loss is used as the evaluation metric to identify the best hyperparameter setting.

Examples

Run this code
set.seed(123)
N <- 200
X <- matrix(rnorm(N * 5), nrow = N, ncol = 5)

# Simulate treatment assignment
A <- rbinom(N, 1, plogis(X[, 1] - 0.5 * X[, 2]))

# Simulate outcome
Y_full <- rbinom(N, 1, plogis(0.5 * X[, 1] - 0.25 * X[, 3]))

# Introduce some missingness to simulate semi-supervised data
Y <- Y_full
Y[sample(1:N, size = N/4)] <- NA  # 25% missing

# Create R vector (labelled = 1, unlabelled = 0)
R <- ifelse(!is.na(Y), 1, 0)

# Cross-validation fold assignment
foldid <- sample(rep(1:5, length.out = N))

# Run cf with glm model
result <- cf(Y = Y, X = X, nfold = 5, R = R, foldid = foldid, cf_model = "glm")

# Examine output
print(result$log_losses)
print(result$best_rounds_index)

Run the code above in your browser using DataLab