coxph
Fits a Cox proportional hazards regression model. Time dependent variables, time dependent strata, multiple events per subject, and other extensions are incorporated using the counting process formulation of Andersen and Gill.
 Keywords
 survival
Usage
coxph(formula, data=, weights, subset,
na.action, init, control,
ties=c("efron","breslow","exact"),
singular.ok=TRUE, robust=FALSE,
model=FALSE, x=FALSE, y=TRUE, tt, method, ...)
Arguments
 formula

a formula object, with the response on the left of a
~
operator, and the terms on the right. The response must be a survival object as returned by theSurv
function.  data

a data.frame in which to interpret the variables named in
the
formula
, or in thesubset
and theweights
argument.  weights
 vector of case weights. For a thorough discussion of these see the book by Therneau and Grambsch.
 subset
 expression indicating which subset of the rows of data should be used in the fit. All observations are included by default.
 na.action

a missingdata filter function. This is applied to the model.frame
after any
subset argument has been used. Default is
options()\$na.action
.  init
 vector of initial values of the iteration. Default initial value is zero for all variables.
 control

Object of class
coxph.control
specifying iteration limit and other control options. Default iscoxph.control(...)
.  ties
 a character string specifying the method for tie handling. If there are no tied death times all the methods are equivalent. Nearly all Cox regression programs use the Breslow method by default, but not this one. The Efron approximation is used as the default here, it is more accurate when dealing with tied death times, and is as efficient computationally. The ``exact partial likelihood'' is equivalent to a conditional logistic model, and is appropriate when the times are a small set of discrete values. See further below.
 singular.ok

logical value indicating how to handle collinearity in the model matrix.
If
TRUE
, the program will automatically skip over columns of the X matrix that are linear combinations of earlier columns. In this case the coefficients for such columns will be NA, and the variance matrix will contain zeros. For ancillary calculations, such as the linear predictor, the missing coefficients are treated as zeros.  robust
 this argument has been deprecated, use a cluster term in the model instead. (The two options accomplish the same goal  creation of a robust variance  but the second is more flexible).
 model

logical value: if
TRUE
, the model frame is returned in componentmodel
.  x

logical value: if
TRUE
, the x matrix is returned in componentx
.  y

logical value: if
TRUE
, the response vector is returned in componenty
.  tt
 optional list of timetransform functions.
 method
 alternate name for the
ties
argument.  ...
 Other arguments will be passed to
coxph.control
Details
The proportional hazards model is usually expressed in terms of a single survival time value for each person, with possible censoring. Andersen and Gill reformulated the same problem as a counting process; as time marches onward we observe the events for a subject, rather like watching a Geiger counter. The data for a subject is presented as multiple rows or "observations", each of which applies to an interval of observation (start, stop]. The routine internally scales and centers data to avoid overflow in the argument to the exponential function. These actions do not change the result, but lead to more numerical stability. However, arguments to offset are not scaled since there are situations where a large offset value is a purposefully used. Users should not use normally allow large numeric offset values.
Value
an object of class coxph
representing the fit.
See coxph.object
for details.
Side Effects
Depending on the call, the predict
, residuals
,
and survfit
routines may
need to reconstruct the x matrix created by coxph
.
It is possible for this to fail, as in the example below in
which the predict function is unable to find tform
.
tfun < function(tform) coxph(tform, data=lung) fit < tfun(Surv(time, status) ~ age) predict(fit)In such a case add the
model=TRUE
option to the
coxph
call to obviate the
need for reconstruction, at the expense of a larger fit
object.
Special terms
There are three special terms that may be used in the model equation.
A strata
term identifies a stratified Cox model; separate baseline
hazard functions are fit for each strata.
The cluster
term is used to compute a robust variance for the model.
The term + cluster(id)
where each value of id
is unique is
equivalent to
specifying the robust=T
argument.
If the id
variable is not
unique, it is assumed that it identifies clusters of correlated
observations.
The robust estimate arises from many different arguments and thus has
had many labels. It is variously known as the
Huber sandwich estimator, White's estimate (linear models/econometrics),
the HorvitzThompson estimate (survey sampling), the working
independence variance (generalized estimating equations), the
infinitesimal jackknife, and the Wei, Lin, Weissfeld (WLW) estimate. A timetransform term allows variables to vary dynamically in time. In
this case the tt
argument will be a function or a list of
functions (if there are more than one tt() term in the model) giving the
appropriate transform. See the examples below.
Convergence
In certain data cases the actual MLE estimate of a coefficient is infinity, e.g., a dichotomous variable where one of the groups has no events. When this happens the associated coefficient grows at a steady pace and a race condition will exist in the fitting routine: either the log likelihood converges, the information matrix becomes effectively singular, an argument to exp becomes too large for the computer hardware, or the maximum number of interactions is exceeded. (Nearly always the first occurs.) The routine attempts to detect when this has happened, not always successfully. The primary consequence for he user is that the Wald statistic = coefficient/se(coefficient) is not valid in this case and should be ignored; the likelihood ratio and score tests remain valid however.
Ties
There are three possible choices for handling tied event times.
The Breslow approximation is the easiest to program and hence became the
first option coded for almost all computer routines. It then ended up
as the default option when other options were added in order to "maintain
backwards compatability". The Efron option is more accurate if there are
a large number of ties, and it is the default option here.
In practice the number of ties is usually small, in which case all the
methods are statistically indistinguishable. Using the "exact partial likelihood" approach the Cox partial likelihood
is equivalent to that for matched logistic regression. (The
clogit
function uses the coxph
code to do the fit.)
It is technically appropriate when the time scale is discrete and has
only a few unique values, and some packages refer to this as the
"discrete" option. There is also an "exact marginal likelihood" due to
Prentice which is not implemented here. The calculation of the exact partial likelihood is numerically intense.
Say for instance 15 of 180 subjects at risk had an event on day 7; then
the code needs to compute sums over all 180choose15 > 10^43 different
possible subsets of size 15. There is an efficient recursive algorithm
for this task, but even with this the computation can be insufferably
long. With (start, stop) data it is much worse since the recursion
needs to start anew for each unique start time.
Penalized regression
coxph
can now maximise a penalised partial likelihood with
arbitrary userdefined penalty. Supplied penalty functions include
ridge regression (ridge), smoothing splines
(pspline), and frailty models (frailty).
References
Andersen, P. and Gill, R. (1982). Cox's regression model for counting processes, a large sample study. Annals of Statistics 10, 11001120. Therneau, T., Grambsch, P., Modeling Survival Data: Extending the Cox Model. SpringerVerlag, 2000.
See Also
Examples
# Create the simplest test data set
test1 < list(time=c(4,3,1,1,2,2,3),
status=c(1,1,1,0,1,1,0),
x=c(0,2,1,1,1,0,0),
sex=c(0,0,0,0,1,1,1))
# Fit a stratified model
coxph(Surv(time, status) ~ x + strata(sex), test1)
# Create a simple data set for a timedependent model
test2 < list(start=c(1,2,5,2,1,7,3,4,8,8),
stop=c(2,3,6,7,8,9,9,9,14,17),
event=c(1,1,1,1,1,1,1,0,0,0),
x=c(1,0,0,1,0,1,1,1,0,0))
summary(coxph(Surv(start, stop, event) ~ x, test2))
#
# Create a simple data set for a timedependent model
#
test2 < list(start=c(1, 2, 5, 2, 1, 7, 3, 4, 8, 8),
stop =c(2, 3, 6, 7, 8, 9, 9, 9,14,17),
event=c(1, 1, 1, 1, 1, 1, 1, 0, 0, 0),
x =c(1, 0, 0, 1, 0, 1, 1, 1, 0, 0) )
summary( coxph( Surv(start, stop, event) ~ x, test2))
# Fit a stratified model, clustered on patients
bladder1 < bladder[bladder$enum < 5, ]
coxph(Surv(stop, event) ~ (rx + size + number) * strata(enum) +
cluster(id), bladder1)
# Fit a time transform model using current age
coxph(Surv(time, status) ~ ph.ecog + tt(age), data=lung,
tt=function(x,t,...) pspline(x + t/365.25))