iCoxBoost: Interface for cross-validation and model fitting using a formula description

Description

Formula interface for fitting a Cox proportional hazards model by componentwise likelihood based boosting (via a call to CoxBoost), where cross-validation can be performed automatically for determining the number of boosting steps (via a call to cv.CoxBoost).

Usage

iCoxBoost(formula,data=NULL,weights=NULL,subset=NULL,mandatory=NULL,
		  cause=1,standardize=TRUE,stepno=200,
		  criterion=c("pscore","score","hpscore","hscore"),
		  nu=0.1,stepsize.factor=1,varlink=NULL,
		  cv=cvcb.control(),trace=FALSE,...)

Arguments

formula

A formula describing the model to be fitted, similar to a call to coxph. The response must be a survival object, either as returned by Surv or Hist (in a competing risks application).

data

data frame containing the variables described in the formula.

weights

optional vector, for specifying weights for the individual observations.

subset

a vector specifying a subset of observations to be used in the fitting process.

mandatory

vector containing the names of the covariates whose effect is to be estimated un-regularized.

cause

cause of interest in a competing risks setting, when the response is specified by Hist (see e.g. Fine and Gray, 1999; Binder et al. 2009a).

standardize

logical value indicating whether covariates should be standardized for estimation. This does not apply for mandatory covariates, i.e., these are not standardized.

stepno

maximum number of boosting steps to be evaluated when determining the number of boosting steps by cross-validation, otherwise the number of boosting seps itself.

criterion

indicates the criterion to be used for selection in each boosting step. "pscore" corresponds to the penalized score statistics, "score" to the un-penalized score statistics. Different results will only be seen for un-standardized covariates ("pscore" will result in preferential selection of covariates with larger covariance), or if different penalties are used for different covariates. "hpscore" and "hscore" correspond to "pscore" and "score". However, a heuristic is used for evaluating only a subset of covariates in each boosting step, as described in Binder et al. (2011). This can considerably speed up computation, but may lead to different results.

(roughly) the fraction of the partial maximum likelihood estimate used for the update in each boosting step. This is converted into a penalty for the call to CoxBoost. Use smaller values, e.g., 0.01 when there is little information in the data, and larger values, such as 0.1, with much information or when the number of events is larger than the number of covariates. Note that the default for direct calls to CoxBoost corresponds to nu=0.1.

stepsize.factor

determines the step-size modification factor by which the natural step size of boosting steps should be changed after a covariate has been selected in a boosting step. The default (value 1) implies constant nu, for a value < 1 the value nu for a covariate is decreased after it has been selected in a boosting step, and for a value > 1 the value nu is increased. If pendistmat is given, updates of nu are only performed for covariates that have at least one connection to another covariate.

varlink

list for specifying links between covariates, used to re-distribute step sizes when stepsize.factor != 1. The list needs to contain at least two vectors, the first containing the name of the source covariates, the second containing the names of the corresponding target covariates, and a third (optional) vector containing weights between 0 and 1 (defaulting to 1). If nu is increased/descreased for one of the source covariates according to stepsize.factor, the nu for the corresponding target covariate is descreased/increased accordingly (multiplied by the weight). If formula contains interaction terms, als rules for these can be set up, using variable names such as V1:V2 for the interaction term between covariates V1 and V2.

TRUE, for performing cross-validation, with default parameters, FALSE for not performing cross-validation, or list containing the parameters for cross-validation, as obtained from a call to cvcb.control.

trace

logical value indicating whether progress in estimation should be indicated by printing the name of the covariate updated.

...

miscellaneous arguments, passed to the call to cv.CoxBoost.

Value

call, formula, terms: call, formula and terms from the formula interface.
cause: cause of interest.
cv.res: result from cv.CoxBoost, if cross-validation has been performed.

Details

In contrast to gradient boosting (implemented e.g. in the glmboost routine in the R package mboost, using the CoxPH loss function), CoxBoost is not based on gradients of loss functions, but adapts the offset-based boosting approach from Tutz and Binder (2007) for estimating Cox proportional hazards models. In each boosting step the previous boosting steps are incorporated as an offset in penalized partial likelihood estimation, which is employed for obtain an update for one single parameter, i.e., one covariate, in every boosting step. This results in sparse fits similar to Lasso-like approaches, with many estimated coefficients being zero. The main model complexity parameter, the number of boosting steps, is automatically selected by cross-validation using a call to cv.CoxBoost). Note that this will introduce random variation when repeatedly calling iCoxBoost, i.e. it is advised to set/save the random number generator state for reproducible results.

The advantage of the offset-based approach compared to gradient boosting is that the penalty structure is very flexible. In the present implementation this is used for allowing for unpenalized mandatory covariates, which receive a very fast coefficient build-up in the course of the boosting steps, while the other (optional) covariates are subjected to penalization. For example in a microarray setting, the (many) microarray features would be taken to be optional covariates, and the (few) potential clinical covariates would be taken to be mandatory, by including their names in mandatory.

If a group of correlated covariates has influence on the response, e.g. genes from the same pathway, componentwise boosting will often result in a non-zero estimate for only one member of this group. To avoid this, information on the connection between covariates can be provided in varlink. If then, in addition, a penalty updating scheme with stepsize.factor < 1 is chosen, connected covariates are more likely to be chosen in future boosting steps, if a directly connected covariate has been chosen in an earlier boosting step (see Binder and Schumacher, 2009b).

References

Binder, H., Benner, A., Bullinger, L., and Schumacher, M. (2013). Tailoring sparse multivariable regression techniques for prognostic single-nucleotide polymorphism signatures. Statistics in Medicine, doi: 10.1002/sim.5490.

Binder, H., Allignol, A., Schumacher, M., and Beyersmann, J. (2009). Boosting for high-dimensional time-to-event data with competing risks. Bioinformatics, 25:890-896.

Binder, H. and Schumacher, M. (2009). Incorporating pathway information into boosting estimation of high-dimensional risk prediction models. BMC Bioinformatics. 10:18.

Binder, H. and Schumacher, M. (2008). Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinformatics. 9:14.

Tutz, G. and Binder, H. (2007) Boosting ridge regression. Computational Statistics \& Data Analysis, 51(12):6044-6059.

Fine, J. P. and Gray, R. J. (1999). A proportional hazards model for the subdistribution of a competing risk. Journal of the American Statistical Association. 94:496-509.

Examples

Run this code

#   Generate some survival data with 10 informative covariates 
n <- 200; p <- 100
beta <- c(rep(1,2),rep(0,p-2))
x <- matrix(rnorm(n*p),n,p)
actual.data <- as.data.frame(x)
real.time <- -(log(runif(n)))/(10*exp(drop(x %*% beta)))
cens.time <- rexp(n,rate=1/10)
actual.data$status <- ifelse(real.time <= cens.time,1,0)
actual.data$time <- ifelse(real.time <= cens.time,real.time,cens.time)

#   Fit a Cox proportional hazards model by iCoxBoost

cbfit <- iCoxBoost(Surv(time,status) ~ .,data=actual.data) 
summary(cbfit)
plot(cbfit)

#   ... with covariates 1 and 2 being mandatory

cbfit.mand <- iCoxBoost(Surv(time,status) ~ .,data=actual.data,mandatory=c("V1")) 
summary(cbfit.mand)
plot(cbfit.mand)

Run the code above in your browser using DataLab