roclearn: Fit a linear model

Description

Fit a linear model

Usage

roclearn(
  X,
  y,
  lambda,
  penalty = "ridge",
  param.penalty = NULL,
  loss = "hinge",
  approx = NULL,
  intercept = TRUE,
  target.perf = list(),
  param.convergence = list()
)

Value

An object of class "roclearn", a list containing:

beta.hat — estimated coefficient vector.
intercept — fitted intercept (if applicable).
lambda, penalty, param.penalty, loss.
approx, B (number of sampled pairs if approximation used).
time — training time (seconds).
nobs, p — number of observations and predictors.
converged, n.iter — convergence information.
preprocessing — details on categorical variables, removed columns, and column names.
call — the function call.

Arguments

X

Predictor matrix or data.frame (categorical variables are automatically one-hot encoded).

y

Response vector with class labels in {-1, 1}. Labels given as {0, 1} or as a two-level factor/character are automatically converted to this format.

lambda

Positive scalar regularization parameter.

penalty

Regularization penalty type: "ridge" (default), "lasso", "elastic", "alasso", "scad", or "mcp".

param.penalty

Penalty-specific parameter:

Ignored for "ridge" and "lasso".
Mixing parameter \(\alpha \in (0,1)\) for "elastic". Default is 0.5.
Adaptive weight exponent \(\gamma > 0\) for "alasso". Default is 1.
Tuning parameter (default 3.7) for "scad" and "mcp".

loss

Surrogate loss function type. One of: "hinge" (default), "hinge2" (squared hinge), "logistic", or "exponential".

approx

Logical; enables a scalable approximation to accelerate training. The default is TRUE when nrow(X) >= 1000, and FALSE otherwise. For details about how approximation is applied, see the details section.

intercept

Logical; include an intercept in the model (default TRUE).

target.perf

List with target sensitivity and specificity used when estimating the intercept (defaults to 0.9 each).

param.convergence

List of convergence controls (e.g., maxiter, eps). Default is list(maxiter = 5e4, eps = 1e-4).

Details

For large-scale data, the model is computationally prohibitive because its loss is a U-statistic involving a double summation. To reduce this burden, the package adopts an efficient algorithm based on an incomplete U-statistic, which approximates the loss with a single summation. These approximations substantially reduce computational cost and accelerate training, while maintaining accuracy, making the model feasible for large-scale datasets. This option is available when approx = TRUE.

Examples

Run this code

set.seed(123)
n <- 100
n_pos <- round(0.2 * n)
n_neg <- n - n_pos
X <- rbind(
  matrix(rnorm(2 * n_neg, mean = -1), ncol = 2),
  matrix(rnorm(2 * n_pos, mean =  1), ncol = 2)
)
y <- c(rep(-1, n_neg), rep(1, n_pos))

fit <- roclearn(X, y, lambda = 0.1, penalty = "ridge", approx=TRUE)

Run the code above in your browser using DataLab