cv.TSLA: Cross validation for TSLA

Description

Conduct cross validation to select the optimal tuning parameters in TSLA.

Usage

cv.TSLA(
  y,
  X_1 = NULL,
  X_2,
  treemat,
  family = c("ls", "logit"),
  penalty = c("CL2", "RFS-Sum"),
  pred.loss = c("MSE", "AUC", "deviance"),
  gamma.init = NULL,
  weight = NULL,
  nfolds = 5,
  group.weight = NULL,
  feature.weight = NULL,
  control = list(),
  modstr = list()
)

Value

A list of cross validation results.

lambda.min: \(\lambda\) value with best prediction performance.
alpha.min: \(\alpha\) value with best prediction performance.
cvm: A (number-of-lambda * number-of-alpha) matrix saving the means of cross validation loss across folds.
cvsd: A (number-of-lambda * number-of-alpha) matrix saving standard deviations of cross validation loss across folds.
TSLA.fit: Outputs from TSLA.fit().
Intercept.min: Intercept corresponding to (lambda.min,alpha.min).
cov.min: Coefficients of unpenalized features corresponding to (lambda.min,alpha.min).
beta.min: Coefficients of binary features corresponding to (lambda.min,alpha.min).
gamma.min: Node coefficients corresponding to (lambda.min,alpha.min).
groupnorm.min: Group norms of node coefficients corresponding to (lambda.min,alpha.min).
lambda.min.index: Index of the best \(\lambda\) in the sequence.
alpha.min.index: Index of the best \(\alpha\) in the sequence.

Arguments

y: Response in matrix form, continuous for family = "ls" and binary (0/1) for family = "logit".
X_1: Design matrix for unpenalized features (excluding intercept). Need to be in the matrix form.
X_2: Expanded design matrix for penalty = "CL2"; Original design matrix for penalty = "RFS-Sum". Need to be in the matrix form.
treemat: Expanded tree structure in matrix form for penalty = "CL2"; Original structure for penalty = "RFS-Sum".
family: Two options. Use "ls" for least square problems and "logit" for logistic regression problems.
penalty: Two options for group penalty on \(\gamma\), "CL2" or "RFS-Sum".
pred.loss: Model performance metrics. If family="ls", default is "MSE" (mean squared error). If family="logit", default is "AUC". For logistic model, another option is "deviance".
gamma.init: Initial value for the optimization. Default is a zero vector. The length should equal to 1+ncol(X_1)+ncol(A). See details of A in get_tree_obj().
weight: A vector of length two and it is used for logistic regression only. The first element corresponds to weight of y=1 and the second element corresponds to weight of y=0.
nfolds: Number of cross validation folds. Default is 5.
group.weight: User-defined weights for group penalty. Need to be a vector and the length equals to the number of groups.
feature.weight: User-defined weights for each predictor after expansion.
control: A list of parameters controlling algorithm convergence. Default values: tol = 1e-5, convergence tolerance; maxit = 10000, maximum number of iterations; mu = 1e-3, smoothness parameter in SPG.
modstr: A list of parameters controlling tuning parameters. Default values: lambda = NULL. If lambda is not provided, the package will give a default lambda sequence; lambda.min.ratio = 1e-04, smallest value for lambda as a fraction of lambda.max (given by default when lambda is NULL); nlambda = 50, number of lambda values (equal spacing on log scale) used when lambda is NULL; alpha = seq(0, 1, length.out = 10), sequence of alpha. Here, alpha is tuning parameter for generalized lasso penalty and 1-alpha is the tuning parameter for group lasso penalty.

Examples

Run this code

# Load the synthetic data
data(ClassificationExample)

tree.org <- ClassificationExample$tree.org   # original tree structure
x2.org <- ClassificationExample$x.org      # original design matrix
x1 <- ClassificationExample$x1
y <- ClassificationExample$y            # response

# Do the tree-guided expansion
expand.data <- getetmat(tree.org, x2.org)
x2 <- expand.data$x.expand              # expanded design matrix
tree.expand <- expand.data$tree.expand  # expanded tree structure

# Do train-test split
idtrain <- 1:200
x1.train <- as.matrix(x1[idtrain, ])
x2.train <- x2[idtrain, ]
y.train <- y[idtrain, ]
x1.test <- as.matrix(x1[-idtrain, ])
x2.test <- x2[-idtrain, ]
y.test <- y[-idtrain, ]

# specify some model parameters
set.seed(100)
control <- list(maxit = 100, mu = 1e-3, tol = 1e-5, verbose = FALSE)
modstr <- list(nlambda = 5,  alpha = seq(0, 1, length.out = 5))
simu.cv <- cv.TSLA(y = y.train, as.matrix(x1[idtrain, ]),
                   X_2 = x2.train,
                   treemat = tree.expand, family = 'logit',
                   penalty = 'CL2', pred.loss = 'AUC',
                   gamma.init = NULL, weight = c(1, 1), nfolds = 5,
                   group.weight = NULL, feature.weight = NULL,
                   control = control, modstr =  modstr)
# Do prediction with the selected tuning parameters on the test set. Report AUC on the test set.
rmid <- simu.cv$TSLA.fit$rmid  # remove all zero columns
if(length(rmid) > 0){
  x2.test <- x2.test[, -rmid]}
  y.new <- predict_cvTSLA(simu.cv, as.matrix(x1[-idtrain, ]), x2.test)
  library(pROC)
  auc(as.vector(y.test), as.vector(y.new))

Run the code above in your browser using DataLab