SMMAL_ada_lasso: Adaptive LASSO with Cross-Validation

Description

Performs adaptive LASSO for binary outcomes by first fitting a ridge regression to compute penalty factors, and then running cross-validated lasso fits over a grid of lambda values.

Usage

SMMAL_ada_lasso(
  X,
  Y,
  X_full,
  foldid,
  foldid_labelled,
  sub_set,
  labeled_indices,
  nfold,
  log_loss
)

Value

A list of length equal to the number of ridge penalty values provided by param_fun(). Each element is a numeric vector (length = n) containing cross-validated predicted probabilities for the best lambda under that ridge penalty.

Arguments

X: A numeric matrix of predictors (n observations × p features).
Y: A numeric or integer vector of binary outcomes (length n).
X_full: The full matrix of predictors for all observations.
foldid: A vector assigning each observation (labelled or unlabelled) to a fold.
foldid_labelled: An integer vector (length n) of fold assignments for labeled observations. Values should run from 1 to nfold; other values (e.g., NA) indicate unlabeled or held-out rows.
sub_set: A logical or integer vector indicating which rows of X/Y are used in supervised CV.
labeled_indices: An integer or logical vector indicating which rows have non-missing outcomes.
nfold: A single integer specifying the number of CV folds (e.g., 5 or 10).
log_loss: A function of the form function(true_labels, pred_probs) that returns a single log-loss numeric.

Details

This function expects that a parameter-generating function param_fun() is available in the package, returning a list with elements $ridge (a vector of ridge penalty values) and $lambda (a vector of lasso penalty values). Internally, it:

Fits a ridge-penalized logistic regression on all data to obtain coefficients.
Computes penalty factors as 1 / (abs(coef) + 1e-4).
For each ridge value, runs n-fold CV over lambda values with glmnet(..., alpha=1).
Records predictions on held-out folds, computes log-loss for each lambda, and selects the lambda with minimum log-loss.
Returns a list of CV-predicted probability vectors (one vector per ridge value).

Examples

Run this code

# \donttest{
# Assume param_fun() is defined elsewhere and returns:
#   list(ridge = c(0.01, 0.1, 1), lambda = exp(seq(log(0.001), log(1), length = 50)))

# Simulate small data:
set.seed(123)
n <- 100; p <- 10
X <- matrix(rnorm(n * p), nrow = n)
true_beta <- c(rep(1.5, 3), rep(0, p - 3))
lin <- X %*% true_beta
probs <- 1 / (1 + exp(-lin))
Y <- rbinom(n, 1, probs)

# Create fold assignments for labeled observations:
labeled <- sample(c(TRUE, FALSE), n, replace = TRUE, prob = c(0.8, 0.2))
foldid_labelled <- rep(NA_integer_, n)
foldid_labelled[labeled] <- sample(1:5, sum(labeled), replace = TRUE)
sub_set         <- labeled
labeled_indices <- which(labeled)

# For simplicity, assign foldid to all observations (labeled & unlabeled)
foldid <- sample(1:5, n, replace = TRUE)

# Define a simple log-loss function:
log_loss_fn <- function(true, pred) {
  eps <- 1e-15
  pred_clipped <- pmin(pmax(pred, eps), 1 - eps)
  -mean(true * log(pred_clipped) + (1 - true) * log(1 - pred_clipped))
}

# Call SMMAL_ada_lasso with all required args:
results <- SMMAL_ada_lasso(
  X = X,
  Y = Y,
  X_full = X,  # Here full data same as X for example
  foldid = foldid,
  foldid_labelled = foldid_labelled,
  sub_set = sub_set,
  labeled_indices = labeled_indices,
  nfold = 5,
  log_loss = log_loss_fn
)

# 'results' is a list (one element per ridge value), each a numeric vector of CV predictions.
# }

Run the code above in your browser using DataLab