Comparing methods for time varying logistic models

knitr::knit_hooks$set( mySettings = function(before, options, envir){ if (before && options$mySettings){ par( mar = c(5, 5, 1, 1), bty = "n", xaxs = "i", pch=16, cex= (cex <- .4), cex.axis = .8 / cex, cex.lab = .8 / cex, lwd= 1) } else if(before && options$plot_2x1){ par( mfcol = c(2, 1), mar = c(4, 4, 2, 2), cex = .75) }}) options(digits = 3, width = 80, warn = -1) knitr::opts_chunk$set( echo = TRUE, mySettings=TRUE, fig.height=3.5, fig.width = 6, warning = F, message = F, plot_2x1 = FALSE) knitr::opts_knit$set(warning = F, message = F)

\renewcommand{\vec}[1]{\bm{#1}} \newcommand{\mat}[1]{\mathbf{#1}}

Introduction

In this vignette, we compare the dynamic logistic model in dynamichazard with other methods within the package and methods from the timereg and mgcv packages. Further this note will serve as an illustration of how to use the ddhazard function for the logistic model. We will use the pbc2 dataset from the survival package. The motivation is that the pbc2 data set is commonly used in survival analysis for illustrations. It is suggested to first read or skim vignette("ddhazard", "dynamichazard") to get an introduction to the models and estimation methods in this package.

The note is structured as follows: First, we cover the pbc2 data set. Then we estimate two static (non-dynamic) logistic regression models using glm. This is followed by a fit using a Generalized Additive model with the gam function in the mgcv package. Next, we will estimate a cox-model with time varying parameters using the timecox function in the timereg package. Finally, we will end by illustrating the methods in this package for time varying parameters in a logistic regression.

You can install the version of the library used to make this vignettes from github with the devtools library as follows:

tryCatch({ current_sha <- paste0("@", httr::content( httr::GET("https://api.github.com/repos/boennecd/dynamichazard/git/refs/heads/master") )$object$sha) }, error = function(...){ current_sha <<- "" }) stopifnot(length(current_sha) > 0 && class(current_sha) == "character") current_version <- paste0("boennecd/dynamichazard@", current_sha)
current_version # The string you need to pass devtools::install_github
devtools::install_github(current_version)

You can also get the latest version on CRAN by calling:

install.packages("dynamichazard")

The pbc data set

The pbc data set contains data from the Mayo Clinic trial on primary biliary cirrhosis. We use this dataset to compare with results previously found analyzing the data set. We will focus on a the subset of co-variates used in [@martinussen07]. The dataset can be created by:

# PBC data set from survival with time variying covariates # Details of tmerge are not important in this scope. The code is included # to make you able to reproduce the results # See: https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf library(survival) temp <- subset(pbc, id <= 312, select=c(id, sex, time, status, edema, age)) pbc2 <- tmerge(temp, temp, id=id, death = event(time, status)) pbc2 <- tmerge(pbc2, pbcseq, id=id, albumin = tdc(day, albumin), protime = tdc(day, protime), bili = tdc(day, bili)) pbc2 <- pbc2[, c("id", "tstart", "tstop", "death", "sex", "edema", "age", "albumin", "protime", "bili")]

as described in the vignette Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model in the survival package. The resulting data frame is structured as follows:

head(pbc2)

The data set is in the usual start and stop time format used in survival analysis. Each individual in the trial has one or more observations. The id column identifies the individual. The tstart column indicates when the row is valid from and the tstop column indicates when the row is valid to. The death column is the outcome which is 2 when the individual dies at tstop (1 indicates that the individual gets a transplant). The sex, edema and age are baseline variables while albumin, protime and bili are laboratory values updated at follow ups with the patient. As an example, we can look individual 282:

(ex <- pbc2[pbc2$id == 282, ])

She (sex is f) has had four laboratory values measured at time r ex$tstart[1], r ex$tstart[2], r ex$tstart[3] and r ex$tstart[4]. Further, she does not die as the death column is zero in the final row.

Logistic model

We can start of with a simple logistic model where we ignore tstart and tstop variable using glm:

glm_simple <- glm(death == 2 ~ age + edema + log(albumin) + log(protime) + log(bili), binomial, pbc2) glm_simple$coefficients

Though, we want to account for the fact that say the second the row of individual 282 has a length of r (ex$tstop - ex$tstart)[2] days (r ex$tstop[2] minus r ex$tstart[2]) while the fourth row has a length r (ex$tstop - ex$tstart)[4] days. A way to incorporate this information is to bin the observations into periods of a given length. Example of such binning is found in [@tutz16] and [@shumway01].

Say that we set the bin interval lengths to 100 days. Then the first row for id = 282 will yield three observation: one from 0 to 100, one from 100 to 200 and one from 200 to 300 (since we know that she survives beyond time 300). That is, she survives from time 0 to time 100, survives from time 100 to time 200 etc. We can make such a data frame with the get_survival_case_weights_and_data function in this package:

library(dynamichazard) pbc2_big_frame <- get_survival_case_weights_and_data( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), data = pbc2, id = pbc2$id, by = 100, max_T = 3600, use_weights = F) pbc2_big_frame <- pbc2_big_frame$X

The code above uses the Surv function on the left hand side of the formula. The Surv function needs a start time, a stop time and an outcome. The right hand side is as before. The by argument specifies the interval length (here 100 days) and the max_T specify the last time we want to include. We will comment on use_weights argument shortly. As an example, we can look at individual 282 in the new data frame:

pbc2_big_frame[pbc2_big_frame$id == 282, ]

Notice that three new columns have been added: Y which is the outcome, t which is the stop time in the bin and weights which is the weight to be used in a regression. We see that the first row in the initial data frame for individual 282 has three rows now since the row is in three bins (the bins at times $(0,100],(100,200]$ and $(200,300]$). We could just use weights instead. This is what we get if we set use_weights = T:

pbc2_small_frame <- get_survival_case_weights_and_data( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), data = pbc2, id = pbc2$id, by = 100, max_T = 3600, use_weights = T) pbc2_small_frame <- pbc2_small_frame$X

The new data rows for individual 282 looks as follows:

pbc2_small_frame[pbc2_small_frame$id == 282, ]

Further, individuals who do die are treated a bit differently. For instance, take individual 268:

pbc2[pbc2$id == 268, ] # the orginal data pbc2_small_frame[pbc2_small_frame$id == 268, ] # new data

Notice, that we have to add an additional row with weight 1 where Y = 1 as it would be wrong to give a weight of 10 to a the row with Y = 1. She survives for 11 bins and dies in the 12th bin. Finally, we can fit the model with the glm function using either of the two data frames as follows:

glm_fit_big <- glm(Y ~ age + edema + log(albumin) + log(protime) + log(bili), binomial, pbc2_big_frame) glm_fit_small <- glm(Y ~ age + edema + log(albumin) + log(protime) + log(bili), binomial, pbc2_small_frame, weights = pbc2_small_frame$weights) # <-- we use weights

We can confirm that the two models give the same estimate:

all.equal(glm_fit_big$coefficients, glm_fit_small$coefficients)

Further, the binning affects the estimates as shown below. In particular, it affects the estimates for edema and log(albumin). The standard errors from the simple fit are also printed. However, these standard errors do not account for the dependence as we use multiple observations from the each individual.

knitr::kable( rbind("glm with bins" = glm_fit_big$coefficients, "glm without bins" = glm_simple$coefficients, "Sds with bins" = summary(glm_fit_big)[["coefficients"]][, "Std. Error"], "Sds without bins" = summary(glm_simple)[["coefficients"]][, "Std. Error"]), row.names = T)

To end this section, you can skip making data frame with get_survival_case_weights_and_data by calling the static_glm function from dynamichazard package. The call below yields the same results as shown:

static_glm_fit <- static_glm( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), data = pbc2, id = pbc2$id, by = 100, max_T = 3600) all.equal(static_glm_fit$coefficients, glm_fit_big$coefficients)

For details, see the help file by typing ?static_glm.

Generalized Additive Models using mgvc

The first method we will compare with is Generalized Additive Models (GAM) by using the gam function in the mgcv package. The model we fit is of the form:

$$\begin{aligned} \text{logit}(\pii) &= \vec{\gamma}{\text{time}}\vec{f}_{\text{time}}(t_i)

+ \vec{\gamma}_{\text{age}}\vec{f}_{\text{time}}(t_i)a_i
+ \vec{\gamma}_{\text{ede}}\vec{f}_{\text{ede}}(t_i)e_i
+ \vec{\gamma}_{\text{alb}}\vec{f}_{\text{alb}}(t_i)\log al_{it} \\
&\quad + \vec{\gamma}_{\text{pro}}\vec{f}_{\text{alb}}(t_i)\log p_{it}
+ \vec{\gamma}_{\text{biil}}\vec{f}_{\text{bili}}(t_i)\log b_{it}

\end{aligned}$$

where $\pi_{it}$ is the probability that the $i$'th individual dies of cancer, $t$ is the stop time of the bin and $a_i$, $ei$, $al{it}$, $p{it}$ and $b{it}$ are respectively the $i$'th individual's age, edema, albumin, protime and bilirunbin. The extra subscript $t$ is added to refer to the level of the covariate in the bin at time $t$. It is important to notice that we will use the same bins as shown previously. In particular, we will use the pbc2_big_frame data frame from before. $\vec{f}_{\cdot}$ are a smooth function. We will use cubic regression splines with knots spread evenly through the covariate values. We fit the model with the following call:

library(mgcv, quietly = T) spline_fit <- gam( formula = Y ~ # cr is cubic spline with k knots s(t, bs = "cr", k = 3, by = age) + s(t, bs = "cr", k = 3, by = edema) + s(t, bs = "cr", k = 5, by = log(albumin)) + s(t, bs = "cr", k = 5, by = log(protime)) + s(t, bs = "cr", k = 5, by = log(bili)), family = binomial, data = pbc2_big_frame)

The above estimates the GAM model where the likelihood is penalized by a L2 norm for each of the spline functions. The tuning parameters are chosen by generalized cross validation. Below are plots of the estimates:

plot(spline_fit, scale = 0, page = 1, rug = F)

Further, we compare the result with static model. Recall that our static model had the following estimates:

glm_fit_big$coefficients

These do seem to correspond with the plots. Further, the intercept in the spline model is:

spline_fit$coefficients["(Intercept)"]

which again seems to match. The plot suggest that there may be time varying effects for bili particularly

Time varying cox model from timereg

Another method we can try is a time varying effects Cox model. We will use the Cox model from the package timereg based on [@martinussen07]. The model differs from a regular Cox model (e.g. by using the coxph function from the survival package) by replacing the coefficient $\vec{\beta}$ by $\vec{\beta}(t)$ such that the instantaneous hazard is:

$$\lambda(t) = \lambda_0(t) \exp \left( \vec{x}\vec{\beta}(t) \right)$$

where each element of $\vec{\beta}(t)$ is estimated recursively with an update equation. The update equation is simplified through a first order Taylor expansion of the score function and by adding a smoothness by using weighting depending on time changes with a uniform continuous kernel. For details see [@martinussen07] in chapter 6. The estimation method differs from other alternatives in R that use splines for the time varying effects. The baseline is $\lambda_0(t)=\exp(\alpha_0(t))$ where $\alpha_0(t)$ is estimated in a similar to way to $\vec{\beta}(t)$. Below we estimate a model similar to the previously fitted models.

We set the last observation time (max.time) lower than in the previous model as there are issues with convergence if we do not. For the same reason we specify the effect of log(protime) to be constant (non-time varying). The cumulated coefficients $B(t) = \int_0^t \beta(t) dt$ are plotted to inspect the fitted model. Thus, a constant effect should be roughly linear.

library(timereg) cox_fit<- timecox(Surv(tstart / 365, tstop / 365, death == 2) ~ age + edema + log(albumin) + const(log(protime)) + log(bili), pbc2, start.time=0, max.time = 3000 / 365, # <-- decreased id = pbc2$id, bandwidth = 0.35) par(mfcol = c(3, 2)) plot(cox_fit)

The timecox packages provides two test for whether the coefficient is time varying or not:

summary(cox_fit)

The above test suggest that only edema might be "border line" time-varying.

Dynamic hazard model

In this section, we will cover the dynamic hazard model with the logistic link function that is implemented in this package. The model is estimated with an EM-algorithm which are from [@fahrmeir94] and [@fahrmeir92] when an Extended Kalman Filter is used in the E-step are. Firstly, we will briefly cover the model. See vignette("ddhazard", "dynamichazard") for a more detailed explanation of the models. Secondly, we will turn to different ways of designing the model and fitting the model. The idea is that we discretize the outcomes into $1,2,\dots,d$ bins. In each bin, we observe whether the individual dies or survives. The state space model we are applying is of the form: $$\begin{array}{ll} \vec{y}_t = \vec{z}_t(\vec{\alpha}_t) + \vec{\epsilon}_t \qquad & \vec{\epsilon}_t \sim (\vec{0}, \text{Var}(\vec{y}_t | \vec{\alpha}t)) \ \vec{\alpha}{t + 1} = \mathbf{F}\vec{\alpha}_t + \mathbf{r}_t \qquad & \vec{\eta}_t \sim N(\vec{0}, \psi_t \mathbf{Q}) \ \end{array} , \qquad t = 1,\dots, d$$

where $y{it} \in {0,1}$ is an indicator for whether the $i$'th individual dies in interval $t$. $\cdots \sim (a, b)$ denotes a random variable with mean (or mean vector) $a$ and variance (or co-variance matrix) $b$. It needs not to be normally distributed. $z{it}(\vec{\alpha}_t) = h(\vec{\alpha}t\vec{x}{it})$ is the non-linear map from state space variables to mean outcomes where $h$ is the inverse link function. We use the logit model in this note. Thus, $h(x) = \text{logit}^{-1}(x)$. The current implementation supports first and second order random walk for the state equation. Further, we define the conditional covariance matrix in the observational equation as $\mathbf{H}_t(\vec{\alpha}_t) = \text{Var}(\vec{y}_t | \vec{\alpha}_t)$. $\psi_t$ is the length of the bin and is assumed equal for all values of $t$.

The unknown parameters are the starting value $\vec{\alpha}_0$ and co-variance matrix $\mathbf{Q}$. These are estimated with an EM-algorithm. The E-step is calculated using a Extended Kalman filter (EKF), Unscented Kalman filter (UKF) or sequential approximation of the modes. All are followed by a smoothing step. The result is smoothed predictions of $\vec{\alpha}_1,\dots, \vec{\alpha}_d$, smoothed co-variance matrix and smoothed correlation matrices that we need for the M-step to update $\vec{\alpha}_0$ and $\mathbf{Q}$.

Estimation with Extended Kalman Filter

We start by estimating the model using the EKF where we let all coefficients follow a first order random walk:

library(dynamichazard) dd_fit <- ddhazard(Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), pbc2, id = pbc2$id, by = 100, max_T = 3600, Q_0 = diag(100, 6), Q = diag(0.01, 6), control = ddhazard_control(eps = .001)) plot(dd_fit)

The arguments Q_0 and Q corresponds to the co-variance matrix at time zero and the co-variance matrix in the state equation. Q_0 will remain fixed while Q is the starting value in the first iteration of the EM algorithm after which we update $\mathbf{Q}$.

Next, we plot the coefficient. That is, we plot the predicted latent variables $\vec{\alpha}_0,\dots,\vec{\alpha}_d$. Notice that the predicted coefficient are close to the estimates we saw previously for the GAM model.

Extra iteration in the correction step

Another idea is to take extra iterations in the correction step of the EKF. The motivation is that this step has the form of a Newton Rapshon algorithm as pointed out in [@fahrmeir92] and [@fahrmeir94]. Below, we estimate the model with potentially extra steps in the correction step.

# Pre-computed sds of covariates sds <- c(1, 10, .2, .2, .1, 1) dd_fit <- ddhazard( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), pbc2, id = pbc2$id, by = 100, max_T = 3600, Q_0 = diag(10 / sds), Q = diag(0.01 / sds, 6), control = ddhazard_control( eps = .001, NR_eps = 0.0001, # Tolerance in correction step LR = .33 )) # Plot result plot(dd_fit)

First, we run the code with the NR_eps element of the list passed to the control argument set to something that is finite. The value is the threshold for the relative change of in the state vector in correction step of the EKF. We end the code above by creating plots of the new estimates.

Estimation with the Unscented Kalman Filter

Another option is to perform the E-step using an unscented Kalman filter. This is done below. We start by setting the initial co-variance matrix $\mat{Q}_0$ to have large values in the diagonal elements:

dd_fit_UKF <- ddhazard( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), pbc2, id = pbc2$id, by = 100, max_T = 3600, Q_0 = diag(rep(1, 6)), Q = diag(rep(0.01, 6)), control = ddhazard_control(method = "UKF", beta = 0, alpha = 1, eps = 0.1, n_max = 1e4)) plot(dd_fit_UKF)

Clearly, the plots of the estimates are not what we expected. The reason is that $\mat{Q}_0$'s diagonal entries are quite large. The square root of the diagonal entries are used to form the sigma points in the first iteration. Hence, we mainly get estimates that are either zero or one when $\mat{Q}_0$ is a diagonal matrix with large entries. You can run the code below to see how the algorithm progress:

# Not run tmp_file <- file("pick_some_file_name.txt") sink(tmp_file) dd_fit_UKF <- ddhazard( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), pbc2, id = pbc2$id, by = 100, max_T = 3600, Q_0 = diag(rep(1, 6)), Q = diag(rep(0.01, 6)), control = ddhazard_control(method = "UKF", beta = 0, alpha = 1, debug = T)) # <-- prints information in each iteration sink() close(tmp_file)

It will print quite a lot of information and hence it is recommended to use sink to write the output to a file. The main take away is that the conditional co-variance matrices accumulate in each iteration while the state vectors does not move. This motivates us to pick $\mat{Q}$ and $\mat{Q}_0$ more carefully. Our estimates from the EKF was:

diag(dd_fit$Q)

which could motivate us to make the following choices:

dd_fit_UKF <- ddhazard( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), pbc2, id = pbc2$id, by = 100, max_T = 3600, Q_0 = diag(c(0.001, 0.00001, rep(0.001, 4))) * 100, # <-- decreased Q = diag(0.0001, 6), # <-- decreased control = ddhazard_control(method = "UKF", beta = 0, alpha = 1, eps = 0.001)) plot(dd_fit_UKF)

This is comparable to the fits from the EKF and GAM model. The main points here are:

  1. The UKF may work for certain data set. It may require tuning.
  2. The algorithm is sensitive to the choice of $\mat{Q}$ and $\mat{Q}_0$. Further, there is dependence on hyperparameters $\alpha$, $\kappa$ and $\beta$ which we have not explored.
  3. The output you get by setting debug in the list passed to the control argument can be useful. Combine this with sink because the output may be long.
  4. The UKF has shown better performs than the EKF previously. Examples includes [@romanenko04], [@kandepu08], [@julier04], [@wan00] and chapter 11 of [@durbin12].

Estimation with rank-one posterior modes approximation

Another method is to use the sequential rank-one approximation of the posterior posterior modes. This is done below. First, we estimate the model with the Extended Kalman filter and then we estimate the model with the rank-one posterior modes approximation method. We estimate both for comparison.

set.seed(7686280) # <-- Data is permuated so we set a seed
dd_fit_EKF <- ddhazard( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), pbc2, id = pbc2$id, by = 100, max_T = 3600, Q_0 = diag(100, 6), Q = diag(0.01, 6), control = ddhazard_control(eps = .001)) dd_fit_SMA <- ddhazard( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), pbc2, id = pbc2$id, by = 100, max_T = 3600, Q_0 = diag(100, 6), Q = diag(0.01, 6), control = ddhazard_control( method = "SMA", # change estimation method eps = 0.001))

Next, we plot the two sets of predicted coefficients:

par(mfcol = c(2, 3)) for(i in 1:6){ plot(dd_fit_EKF, cov_index = i, col = "black") plot(dd_fit_SMA, cov_index = i, col = "red", add = TRUE) }
par_old <- par(no.readonly = T) par(mfcol = c(2, 3)) par(cex = par_old$cex * .85, mar = par_old$mar * 1.33) for(i in 1:6){ plot(dd_fit_EKF, cov_index = i, col = "black") plot(dd_fit_SMA, cov_index = i, col = "red", add = TRUE) }

The red curves are from the sequential rank-one approximation. The difference seems minor for this data set.

Estimation with fixed effects

We may want to fit a model where we assume that some of the co-variates are fixed. For instance, we may want to fit a model where age, the intercept, log(protime) and log(albumin) are fixed. We fix these based on the previous plots where the effects seems no to be time-varying. The model can be fitted as by the following call:

dd_fit <- ddhazard( Surv(tstart, tstop, death == 2) ~ ddFixed_intercept() + ddFixed(age) + ddFixed(log(albumin)) + edema + ddFixed(log(protime)) + log(bili), pbc2, id = pbc2$id, by = 100, max_T = 3600, Q_0 = diag(100, 2), Q = diag(0.01, 2), control = ddhazard_control(eps = .001))
plot(dd_fit)

The predicted curves seems similar. Moreover, the fixed effects are in agreement with the previous fits (they are printed below):

dd_fit$fixed_effects

Second order random walk

We will end by fitting a second order random walk to model where only the log(bili) effect is time-varying. The motivation is that the second order random walk tend to diverge more easily especially for small data sets. Further, the previous models seems to suggest that it is the only covariate where we may have a time-varying coefficient. First, we fit the model:

# Define formula form <- Surv(tstart, tstop, death == 2) ~ ddFixed_intercept() + ddFixed(age) + ddFixed(edema) + ddFixed(log(albumin)) + ddFixed(log(protime)) + log(bili) # Fit models dd_fit_EKF <- ddhazard(form, pbc2, id = pbc2$id, by = 100, max_T = 3600, order = 2, # <-- second order Q_0 = diag(5, 2), # <-- needs more elements Q = .0001, # <-- decreased control = ddhazard_control( eps = .0001, est_a_0 = FALSE))

We have to decrease the starting value of Q in the above to get the methods to converge. The fixed effects estimates are:

dd_fit_EKF$fixed_effects

The predicted curve is:

plot(dd_fit_EKF)

We see that the predicted curve is more smooth as expected with a second order random walk. Further, we can confirm that the fixed effects are comparable with our previous fits.

Summary

We have estimated a model using Generalized additive models with the mgcv package and a time-varying Cox model with the timereg package. Further, we have illustrated how the ddhazard function in the dynamichazard package can be used. All the fits have been comparable with the Generalized Additive model.

References