BMsel: Biomarker selection in a Cox regression model

Description

This function enables to fit a Cox regression model for a prognostic or a biomarker-by-treatment interaction setting subject to a selection procedure to perform variable selection.

Usage

BMsel(data, x, y, z, tt, inter, std.x = TRUE, std.i = FALSE, std.tt = TRUE, 
  method = c('alassoL', 'alassoR', 'alassoU', 'enet', 'gboost', 
    'glasso', 'lasso', 'lasso-1se', 'lasso-AIC', 'lasso-BIC', 
    'lasso-HQIC', 'lasso-pct', 'lasso-pcvl','lasso-RIC', 'modCov',
    'PCAlasso', 'PLSlasso', 'ridge', 'ridgelasso', 'stabSel', 'uniFDR'), 
  folds = 5, uni.fdr = 0.05, uni.test = 1, ss.rando = F, ss.nsub = 100,
  ss.fsub = 0.5, ss.fwer = 1, ss.thr = 0.6, dfmax = ncol(data) + 1, 
  pct.rep = 1, pct.qtl = 0.95, showWarn = TRUE, trace = TRUE)
# S3 method for resBMsel
summary(object, show = TRUE, keep = c('tt', 'z', 'x', 'xt'), 
  add.ridge = FALSE, ...)

Arguments

data

input data.frame. Each row is an observation.

colnames or position of the biomarkers in data.

colnames or position of the survival outcome in data. The first column must be the time and the second must be the indicator (0/1).

colnames or position of the clinical covariates in data, if any.

colname or position of the treatment in data, if any.

inter

logical parameter indicating if biomarker-by-treatment interactions should be computed.

std.x

logical parameter indicating if the biomarkers should be standardized (i.e. substracting by the mean and dividing by the standard deviation of each biomarker).

std.i

logical parameter indicating if the biomarker-by-treatment interactions should be standardized (i.e. substracting by the mean and dividing by the standard deviation of each interaction).

std.tt

logical parameter indicating if the treatment should be recoded as +/-0.5.

method

methods computed to perform variable selection and to estimate the regression coefficients. See the Details section to understand all the implemented methods.

folds

number of folds. folds must be either a value between 3 and the sample size (leave-one-out CV, but not recommended for large datasets), or a vector (same length as the sample size) indicating the fold assignment group of each observation.

uni.fdr, uni.test

specific parameters for the univariate procedure. uni.fdr: threshold false discovery rate (FDR) to control for multiple testing (Benjamini and Hochberg, 1995), uni.test: model comparison approach. 1: p-value of the biomarker effect (i.e. main effect for the prognostic setting, or main effect + interaction for the interaction setting), 2: p-value of the interaction (only available for the interaction setting).

ss.fsub, ss.fwer, ss.nsub, ss.rando, ss.thr

specific parameters for the stability selection. ss.fsub: fraction of samples to use in the sampling process, ss.fwer: parameter to control for the family-wise error rate (FWER, i.e. number of noise variables), ss.nsub: number of subsampling, ss.rando: logical parameter indicating if random weights should be added in the lasso penalty, ss.thr: threshold of the stability probability for filtering variable.

dfmax

limit the maximum number of variables in the model. Useful for very large number of covariates to limit the time computation.

pct.rep, pct.qtl

specific parameters for the percentile lasso. pct.rep: number of replicates, pct.qtl: percentile used to estimate the lambda among its empirical distribution.

showWarn

logical parameter indicating if warnings should be printed.

trace

logical parameter indicating if messages should be printed.

object

object of class 'resBMsel' returned by BMsel.

show

parameter for the summary() indicating if the result should be printed.

keep

parameter for the summary() indicating the type of covariates that should be kept for the summary (tt: treatment covariate, z: clinical covariates, x: biomarker main effects and xt: biomarker-by-treatment interactions).

add.ridge

parameter for the summary() indicating if the ridge penalty should be kept for the summary as no selection is performed.

...

other paramaters for plot or summary.

Value

An object of class 'resBMsel' containing the list of the selected biomarkers and their estimated regression coefficients for the chosen methods.

Details

The objects x, y, z (if any) and tt (if any) are mandatory for non-simulated data sets. The method parameter specifies the approaches for model selection. Most of these selection methods are based on the lasso penalty (Tibshirani, 1996). The tuning parameter is usually chosen though the cross-validated log-likelihood criterion (cvl), except for the empirical extensions of the lasso in which additional penalties to the cvl (given with a suffix, e.g. lasso-pcvl) are used to estimate the tuning parameter. Other methods based on the lasso are also implemented such as the adaptive lasso (alassoL, alassoR and alassoU for which the last letter indicates the procedure used to estimate the preliminary weights: "L" for lasso, "R" for ridge and "U" for univariate), the elastic-net (enet) or the stability selection (stabSel). For the interaction setting, specific methods were implemented: to reduce/control the main effects matrix (i.e. ridge (ridgelasso) or dimension reduction (PCAlasso or PLSlasso)), to select or discard main effects and interactions simultaneously (i.e. group-lasso (glasso)), or to include only the interaction part in the model (i.e. modCov). Some selection methods not based on penalized regression are also proposed: univariate selection (uniFDR), gradient boosting (gboost). Finally, even if no selection was performed, the ridge penalty can be computed. In all cases, clinical covariates were considered as unpenalized in the models.

References

Ternes N, Rotolo F and Michiels S. Empirical extensions of the lasso penalty to reduce the false discovery rate in high-dimensional Cox regression models. Statistics in Medicine 2016;35(15):2561-2573. doi:10.1002/sim.6927 Ternes N, Rotolo F, Heinze G and Michiels S. Identification of biomarker-by-treatment interactions in randomized clinical trials with survival outcomes and high-dimensional spaces. Biometrical journal. In press. doi:10.1002/bimj.201500234 Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Ser B 1996;58:267-288.

Examples

Run this code

# NOT RUN {
########################################
# Simulated data set
########################################

## Low calculation time
  set.seed(654321)
  sdata <- simdata(
    n = 500, p = 20, q.main = 3, q.inter = 0,
    prob.tt = 0.5, alpha.tt = 0,
    beta.main = -0.8,
    b.corr = 0.6, b.corr.by = 4,
    m0 = 5, wei.shape = 1, recr = 4, fu = 2,
    timefactor = 1)

  resBM <- BMsel(
    data = sdata, 
    method = c("lasso", "lasso-pcvl"), 
    inter = FALSE, 
    folds = 5)
  
  summary(resBM)

# }
# NOT RUN {
## Moderate calculation time
  set.seed(123456)
  sdata <- simdata(
    n = 500, p = 100, q.main = 5, q.inter = 5,
    prob.tt = 0.5, alpha.tt = -0.5,
    beta.main = c(-0.5, -0.2), beta.inter = c(-0.7, -0.4),
    b.corr = 0.6, b.corr.by = 10,
    m0 = 5, wei.shape = 1, recr = 4, fu = 2,
    timefactor = 1,
    active.inter = c("bm003", "bm021", "bm044", "bm049", "bm097"))

  resBM <- BMsel(
    data = sdata, 
    method = c("lasso", "lasso-pcvl"), 
    inter = TRUE, 
    folds = 5)
  
  summary(resBM)
  summary(resBM, keep = "xt")
# }
# NOT RUN {
########################################
# Breast cancer data set
########################################

# }
# NOT RUN {
  data(Breast)
  dim(Breast)

  set.seed(123456)
  resBM <-  BMsel(
    data = Breast,
    x = 4:ncol(Breast),
    y = 2:1,
    tt = 3,
    inter = FALSE,
    std.x = TRUE,
    folds = 5,
    method = c("lasso", "lasso-pcvl"))

  summary(resBM)
# }
# NOT RUN {
########################################
########################################
# }

Run the code above in your browser using DataLab