lmSP_cv: Cross-validation for Sparse Adaptive Overlap Group Least Absolute Shrinkage and Selection Operator

Description

Sparse Adaptive overlap group-LASSO, or sparse adaptive group $L_2$-regularized regression, solves the following optimization problem $$\textrm{min}_{\beta,\gamma} ~ \frac{1}{2}\|y-X\beta-Z\gamma\|_2^2 + \lambda\Big[(1-\alpha) \sum_{g=1}^G \|S_g T\beta\|_2+\alpha\Vert T_1\beta\Vert_1\Big]$$ to obtain a sparse coefficient vector $\beta\in\mathbb{R}^p$ for the matrix of penalized predictors $X$ and a coefficient vector $\gamma\in\mathbb{R}^q$ for the matrix of unpenalized predictors $Z$. For each group $g$, each row of the matrix $S_g\in\mathbb{R}^{n_g\times p}$ has non-zero entries only for those variables belonging to that group. These values are provided by the arguments groups and group_weights (see below). Each variable can belong to more than one group. The diagonal matrix $T\in\mathbb{R}^{p\times p}$ contains the variable-specific weights. These values are provided by the argument var_weights (see below). The diagonal matrix $T_1\in\mathbb{R}^{p\times p}$ contains the variable-specific $L_1$ weights. These values are provided by the argument var_weights_L1 (see below). The regularization path is computed for the sparse adaptive overlap group-LASSO penalty at a grid of values for the regularization parameter $\lambda$ using the alternating direction method of multipliers (ADMM). See Boyd et al. (2011) and Lin et al. (2022) for details on the ADMM method. The regularization is a combination of $L_2$ and $L_1$ simultaneous constraints. Different specifications of the penalty argument lead to different models choice:

LASSO: The classical Lasso regularization (Tibshirani, 1996) can be obtained by specifying $\alpha = 1$ and the matrix $T_1$ as the $p \times p$ identity matrix. An adaptive version of this model (Zou, 2006) can be obtained if $T_1$ is a $p \times p$ diagonal matrix of adaptive weights. See also Hastie et al. (2015) for further details.
GLASSO: The group-Lasso regularization (Yuan and Lin, 2006) can be obtained by specifying $\alpha = 0$, non-overlapping groups in $S_g$ and by setting the matrix $T$ equal to the $p \times p$ identity matrix. An adaptive version of this model can be obtained if the matrix $T$ is a $p \times p$ diagonal matrix of adaptive weights. See also Hastie et al. (2015) for further details.
spGLASSO: The sparse group-Lasso regularization (Simon et al., 2011) can be obtained by specifying $\alpha\in(0,1)$, non-overlapping groups in $S_g$ and by setting the matrices $T$ and $T_1$ equal to the $p \times p$ identity matrix. An adaptive version of this model can be obtained if the matrices $T$ and $T_1$ are $p \times p$ diagonal matrices of adaptive weights.
OVGLASSO: The overlap group-Lasso regularization (Jenatton et al., 2011) can be obtained by specifying $\alpha = 0$, overlapping groups in $S_g$ and by setting the matrix $T$ equal to the $p \times p$ identity matrix. An adaptive version of this model can be obtained if the matrix $T$ is a $p \times p$ diagonal matrix of adaptive weights.
spOVGLASSO: The sparse overlap group-Lasso regularization (Jenatton et al., 2011) can be obtained by specifying $\alpha\in(0,1)$, overlapping groups in $S_g$ and by setting the matrices $T$ and $T_1$ equal to the $p \times p$ identity matrix. An adaptive version of this model can be obtained if the matrices $T$ and $T_1$ are $p \times p$ diagonal matrices of adaptive weights.

Usage

lmSP_cv(
  X,
  Z = NULL,
  y,
  penalty = c("LASSO", "GLASSO", "spGLASSO", "OVGLASSO", "spOVGLASSO"),
  groups,
  group_weights = NULL,
  var_weights = NULL,
  var_weights_L1 = NULL,
  cv.fold = 5,
  standardize.data = TRUE,
  intercept = FALSE,
  overall.group = FALSE,
  lambda = NULL,
  alpha = NULL,
  lambda.min.ratio = NULL,
  nlambda = 30,
  control = list()
)

Value

A named list containing

sp.coefficients: a length-$p$ solution vector for the parameters $\beta$. If $n_\lambda>1$ then the provided vector corresponds to the minimum cross-validated MSE.
coefficients: a length-$q$ solution vector for the parameters $\gamma$. If $n_\lambda>1$ then the provided vector corresponds to the minimum cross-validated MSE. It is provided only when either the matrix $Z$ in input is not NULL or the intercept is set to TRUE.
sp.coef.path: an $(n_\lambda\times p)$ matrix of estimated $\beta$ coefficients for each lambda of the provided sequence.
coef.path: an $(n_\lambda\times q)$ matrix of estimated $\gamma$ coefficients for each lambda of the provided sequence. It is provided only when either the matrix $Z$ in input is not NULL or the intercept is set to TRUE.
lambda: sequence of lambda.
lambda.min: value of lambda that attains the minimum cross-validated MSE.
mse: cross-validated mean squared error.
min.mse: minimum value of the cross-validated MSE for the sequence of lambda.
convergence: logical. 1 denotes achieved convergence.
elapsedTime: elapsed time in seconds.
iternum: number of iterations.

When you run the algorithm, output returns not only the solution, but also the iteration history recording following fields over iterates:

objval: objective function value
r_norm: norm of primal residual
s_norm: norm of dual residual
eps_pri: feasibility tolerance for primal feasibility condition
eps_dual: feasibility tolerance for dual feasibility condition.

Iteration stops when both r_norm and s_norm values become smaller than eps_pri and eps_dual, respectively.

Arguments

X: an $(n\times p)$ matrix of penalized predictors.
Z: an $(n\times q)$ full column rank matrix of predictors that are not penalized.
y: a length-$n$ response vector.
penalty: choose one from the following options: 'LASSO', for the or adaptive-Lasso penalties, 'GLASSO', for the group-Lasso penalty, 'spGLASSO', for the sparse group-Lasso penalty, 'OVGLASSO', for the overlap group-Lasso penalty and 'spOVGLASSO', for the sparse overlap group-Lasso penalty.
groups: either a vector of length $p$ of consecutive integers describing the grouping of the coefficients, or a list with two elements: the first element is a vector of length $\sum_{g=1}^G n_g$ containing the variables belonging to each group, where $n_g$ is the cardinality of the $g$-th group, while the second element is a vector of length $G$ containing the group lengths (see example below).
group_weights: a vector of length $G$ containing group-specific weights. The default is square root of the group cardinality, see Yuan and Lin (2006).
var_weights: a vector of length $p$ containing variable-specific weights. The default is a vector of ones.
var_weights_L1: a vector of length $p$ containing variable-specific weights for the $L_1$ penalty. The default is a vector of ones.
cv.fold: the number of folds - default is 5.
standardize.data: logical. Should data be standardized?
intercept: logical. If it is TRUE, a column of ones is added to the design matrix.
overall.group: logical. This setting is only available for the overlap group-LASSO and the sparse overlap group-LASSO penalties, otherwise it is set to NULL. If it is TRUE, an overall group including all penalized covariates is added.
lambda: either a regularization parameter or a vector of regularization parameters. In this latter case the routine computes the whole path. If it is NULL values for lambda are provided by the routine.
alpha: the sparse overlap group-LASSO mixing parameter, with $0\leq\alpha\leq1$. This setting is only available for the sparse group-LASSO and the sparse overlap group-LASSO penalties, otherwise it is set to NULL. The LASSO and group-LASSO penalties are obtained by specifying $\alpha = 1$ and $\alpha = 0$, respectively.
lambda.min.ratio: smallest value for lambda, as a fraction of the maximum lambda value. If $n>p$, the default is 0.0001, and if $n<p$, the default is 0.01.
nlambda: the number of lambda values - default is 30.
control: a list of control parameters for the ADMM algorithm. See ‘Details’.

Details

The control argument is a list that can supply any of the following components:

adaptation: logical. If it is TRUE, ADMM with adaptation is performed. The default value is TRUE. See Boyd et al. (2011) for details.
rho: an augmented Lagrangian parameter. The default value is 1.
tau.ada: an adaptation parameter greater than one. Only needed if adaptation = TRUE. The default value is 2. See Boyd et al. (2011) for details.
mu.ada: an adaptation parameter greater than one. Only needed if adaptation = TRUE. The default value is 10. See Boyd et al. (2011) for details.
abstol: absolute tolerance stopping criterion. The default value is sqrt(sqrt(.Machine$double.eps)).
reltol: relative tolerance stopping criterion. The default value is sqrt(.Machine$double.eps).
maxit: maximum number of iterations. The default value is 100.
print.out: logical. If it is TRUE, a message about the procedure is printed. The default value is TRUE.

References

bernardi_etal.2022fdaSP

boyd_etal.2011fdaSP

hastie_etal.2015fdaSP

jenatton_etal.2011fdaSP

lin_etal.2022fdaSP

simon_etal.2013fdaSP

yuan_lin.2006fdaSP

zou.2006fdaSP

Examples

Run this code


### generate sample data
set.seed(2023)
n    <- 50
p    <- 30 
X    <- matrix(rnorm(n * p), n, p)

### Example 1, LASSO penalty

beta <- apply(matrix(rnorm(p, sd = 1), p, 1), 1, fdaSP::softhresh, 1.5)
y    <- X %*% beta + rnorm(n, sd = sqrt(crossprod(X %*% beta)) / 20)

### set the hyper-parameters of the ADMM algorithm
maxit      <- 1000
adaptation <- TRUE
rho        <- 1
reltol     <- 1e-5
abstol     <- 1e-5

### run cross-validation
mod_cv <- lmSP_cv(X = X, y = y, penalty = "LASSO", 
                  standardize.data = FALSE, intercept = FALSE,
                  cv.fold = 5, nlambda = 30, 
                  control = list("adaptation" = adaptation, 
                                 "rho" = rho, 
                                 "maxit" = maxit, "reltol" = reltol, 
                                 "abstol" = abstol, 
                                 "print.out" = FALSE)) 

### graphical presentation
plot(log(mod_cv$lambda), mod_cv$mse, type = "l", col = "blue", lwd = 2, bty = "n", 
     xlab = latex2exp::TeX("$\\log(\\lambda)$"), ylab = "Prediction Error", 
     ylim = range(mod_cv$mse - mod_cv$mse.sd, mod_cv$mse + mod_cv$mse.sd),
     main = "Cross-validated Prediction Error")
fdaSP::confband(xV = log(mod_cv$lambda), yVmin = mod_cv$mse - mod_cv$mse.sd, 
                yVmax = mod_cv$mse + mod_cv$mse.sd)       
abline(v = log(mod_cv$lambda[which(mod_cv$lambda == mod_cv$lambda.min)]), 
       col = "red", lwd = 1.0)

### comparison with oracle error
mod <- lmSP(X = X, y = y, penalty = "LASSO", 
            standardize.data = FALSE, 
            intercept = FALSE,
            nlambda = 30, 
            control = list("adaptation" = adaptation, 
                           "rho" = rho, 
                           "maxit" = maxit, "reltol" = reltol, 
                           "abstol" = abstol, 
                           "print.out" = FALSE)) 
                                         
err_mod <- apply(mod$sp.coef.path, 1, function(x) sum((x - beta)^2))
plot(log(mod$lambda), err_mod, type = "l", col = "blue", lwd = 2, 
     xlab = latex2exp::TeX("$\\log(\\lambda)$"), 
     ylab = "Estimation Error", main = "True Estimation Error", bty = "n")
abline(v = log(mod$lambda[which(err_mod == min(err_mod))]), col = "red", lwd = 1.0)
abline(v = log(mod_cv$lambda[which(mod_cv$lambda == mod_cv$lambda.min)]), 
       col = "red", lwd = 1.0, lty = 2)

### Example 2, sparse group-LASSO penalty

beta <- c(rep(4, 12), rep(0, p - 13), -2)
y    <- X %*% beta + rnorm(n, sd = sqrt(crossprod(X %*% beta)) / 20)

### define groups of dimension 3 each
group1 <- rep(1:10, each = 3)

### set regularization parameter grid
lam   <- 10^seq(1, -2, length.out = 30)

### set the alpha parameter 
alpha <- 0.5

### set the hyper-parameters of the ADMM algorithm
maxit         <- 1000
adaptation    <- TRUE
rho           <- 1
reltol        <- 1e-5
abstol        <- 1e-5

### run cross-validation
mod_cv <- lmSP_cv(X = X, y = y, penalty = "spGLASSO", 
                  groups = group1, cv.fold = 5, 
                  standardize.data = FALSE,  intercept = FALSE, 
                  lambda = lam, alpha = 0.5, 
                  control = list("adaptation" = adaptation, 
                                 "rho" = rho,
                                 "maxit" = maxit, "reltol" = reltol, 
                                 "abstol" = abstol, 
                                 "print.out" = FALSE)) 

### graphical presentation
plot(log(mod_cv$lambda), mod_cv$mse, type = "l", col = "blue", lwd = 2, bty = "n", 
     xlab = latex2exp::TeX("$\\log(\\lambda)$"), ylab = "Prediction Error", 
     ylim = range(mod_cv$mse - mod_cv$mse.sd, mod_cv$mse + mod_cv$mse.sd),
     main = "Cross-validated Prediction Error")
fdaSP::confband(xV = log(mod_cv$lambda), yVmin = mod_cv$mse - mod_cv$mse.sd, 
                yVmax = mod_cv$mse + mod_cv$mse.sd)       
abline(v = log(mod_cv$lambda[which(mod_cv$lambda == mod_cv$lambda.min)]), 
       col = "red", lwd = 1.0)

### comparison with oracle error
mod <- lmSP(X = X, y = y, 
            penalty = "spGLASSO", 
            groups = group1, 
            standardize.data = FALSE, 
            intercept = FALSE,
            lambda = lam, 
            alpha = 0.5, 
            control = list("adaptation" = adaptation, "rho" = rho, 
                           "maxit" = maxit, "reltol" = reltol, "abstol" = abstol, 
                           "print.out" = FALSE)) 
                                         
err_mod <- apply(mod$sp.coef.path, 1, function(x) sum((x - beta)^2))
plot(log(mod$lambda), err_mod, type = "l", col = "blue", lwd = 2, 
     xlab = latex2exp::TeX("$\\log(\\lambda)$"), 
     ylab = "Estimation Error", main = "True Estimation Error", bty = "n")
abline(v = log(mod$lambda[which(err_mod == min(err_mod))]), col = "red", lwd = 1.0)
abline(v = log(mod_cv$lambda[which(mod_cv$lambda == mod_cv$lambda.min)]), 
       col = "red", lwd = 1.0, lty = 2)

Run the code above in your browser using DataLab