
She (sex is f) has had four laboratory values measured at time r ex$tstart[1], r ex$tstart[2], r ex$tstart[3] and r ex$tstart[4]. Further, she does not die as the death column is zero in the final row.

# Logistic model

We can start of with a simple logistic model where we ignore tstart and tstop variable using glm:

glm_simple <- glm(death == 2 ~ age + edema + log(albumin) + log(protime) + log(bili), binomial, pbc2) glm_simple$coefficients Though, we want to account for the fact that say the second the row of individual 282 has a length of r (ex$tstop - ex$tstart)[2] days (r ex$tstop[2] minus r ex$tstart[2]) while the fourth row has a length r (ex$tstop - ex$tstart)[4] days. A way to incorporate this information is to bin the observations into periods of a given length. Example of such binning is found in [@tutz16] and [@shumway01]. Say that we set the bin interval lengths to 100 days. Then the first row for id = 282 will yield three observation: one from 0 to 100, one from 100 to 200 and one from 200 to 300 (since we know that she survives beyond time 300). That is, she survives from time 0 to time 100, survives from time 100 to time 200 etc. We can make such a data frame with the get_survival_case_weights_and_data function in this package: library(dynamichazard) pbc2_big_frame <- get_survival_case_weights_and_data( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), data = pbc2, id = pbc2$id, by = 100, max_T = 3600, use_weights = F) pbc2_big_frame <- pbc2_big_frame$X The code above uses the Surv function on the left hand side of the formula. The Surv function needs a start time, a stop time and an outcome. The right hand side is as before. The by argument specifies the interval length (here 100 days) and the max_T specify the last time we want to include. We will comment on use_weights argument shortly. As an example, we can look at individual 282 in the new data frame: pbc2_big_frame[pbc2_big_frame$id == 282, ]

Notice that three new columns have been added: Y which is the outcome, t which is the stop time in the bin and weights which is the weight to be used in a regression. We see that the first row in the initial data frame for individual 282 has three rows now since the row is in three bins (the bins at times $(0,100],(100,200]$ and $(200,300]$). We could just use weights instead. This is what we get if we set use_weights = T:

pbc2_small_frame <- get_survival_case_weights_and_data( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), data = pbc2, id = pbc2$id, by = 100, max_T = 3600, use_weights = T) pbc2_small_frame <- pbc2_small_frame$X

The new data rows for individual 282 looks as follows:

pbc2_small_frame[pbc2_small_frame$id == 282, ] Further, individuals who do die are treated a bit differently. For instance, take individual 268: pbc2[pbc2$id == 268, ] # the orginal data pbc2_small_frame[pbc2_small_frame$id == 268, ] # new data Notice, that we have to add an additional row with weight 1 where Y = 1 as it would be wrong to give a weight of 10 to a the row with Y = 1. She survives for 11 bins and dies in the 12th bin. Finally, we can fit the model with the glm function using either of the two data frames as follows: glm_fit_big <- glm(Y ~ age + edema + log(albumin) + log(protime) + log(bili), binomial, pbc2_big_frame) glm_fit_small <- glm(Y ~ age + edema + log(albumin) + log(protime) + log(bili), binomial, pbc2_small_frame, weights = pbc2_small_frame$weights) # <-- we use weights

We can confirm that the two models give the same estimate:

all.equal(glm_fit_big$coefficients, glm_fit_small$coefficients)

Further, the binning affects the estimates as shown below. In particular, it affects the estimates for edema and log(albumin). The standard errors from the simple fit are also printed. However, these standard errors do not account for the dependence as we use multiple observations from the each individual.

knitr::kable( rbind("glm with bins" = glm_fit_big$coefficients, "glm without bins" = glm_simple$coefficients, "Sds with bins" = summary(glm_fit_big)[["coefficients"]][, "Std. Error"], "Sds without bins" = summary(glm_simple)[["coefficients"]][, "Std. Error"]), row.names = T)

To end this section, you can skip making data frame with get_survival_case_weights_and_data by calling the static_glm function from dynamichazard package. The call below yields the same results as shown:

static_glm_fit <- static_glm( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), data = pbc2, id = pbc2$id, by = 100, max_T = 3600) all.equal(static_glm_fit$coefficients, glm_fit_bigcoefficients) For details, see the help file by typing ?static_glm. # Generalized Additive Models using mgvc The first method we will compare with is Generalized Additive Models (GAM) by using the gam function in the mgcv package. The model we fit is of the form: \begin{aligned} \text{logit}(\pii) &= \vec{\gamma}{\text{time}}\vec{f}_{\text{time}}(t_i) + \vec{\gamma}_{\text{age}}\vec{f}_{\text{time}}(t_i)a_i + \vec{\gamma}_{\text{ede}}\vec{f}_{\text{ede}}(t_i)e_i + \vec{\gamma}_{\text{alb}}\vec{f}_{\text{alb}}(t_i)\log al_{it} \\ &\quad + \vec{\gamma}_{\text{pro}}\vec{f}_{\text{alb}}(t_i)\log p_{it} + \vec{\gamma}_{\text{biil}}\vec{f}_{\text{bili}}(t_i)\log b_{it}  \end{aligned} where\pi_{it}$is the probability that the$i$'th individual dies of cancer,$t$is the stop time of the bin and$a_i$,$ei$,$al{it}$,$p{it}$and$b{it}$are respectively the$i$'th individual's age, edema, albumin, protime and bilirunbin. The extra subscript$t$is added to refer to the level of the covariate in the bin at time$t$. It is important to notice that we will use the same bins as shown previously. In particular, we will use the pbc2_big_frame data frame from before.$\vec{f}_{\cdot}$are a smooth function. We will use cubic regression splines with knots spread evenly through the covariate values. We fit the model with the following call: library(mgcv, quietly = T) spline_fit <- gam( formula = Y ~ # cr is cubic spline with k knots s(t, bs = "cr", k = 3, by = age) + s(t, bs = "cr", k = 3, by = edema) + s(t, bs = "cr", k = 5, by = log(albumin)) + s(t, bs = "cr", k = 5, by = log(protime)) + s(t, bs = "cr", k = 5, by = log(bili)), family = binomial, data = pbc2_big_frame) The above estimates the GAM model where the likelihood is penalized by a L2 norm for each of the spline functions. The tuning parameters are chosen by generalized cross validation. Below are plots of the estimates: plot(spline_fit, scale = 0, page = 1, rug = F) Further, we compare the result with static model. Recall that our static model had the following estimates: glm_fit_big$coefficients

These do seem to correspond with the plots. Further, the intercept in the spline model is:

spline_fit$coefficients["(Intercept)"] which again seems to match. The plot suggest that there may be time varying effects for bili particularly # Time varying cox model from timereg Another method we can try is a time varying effects Cox model. We will use the Cox model from the package timereg based on [@martinussen07]. The model differs from a regular Cox model (e.g. by using the coxph function from the survival package) by replacing the coefficient$\vec{\beta}$by$\vec{\beta}(t)$such that the instantaneous hazard is: $$\lambda(t) = \lambda_0(t) \exp \left( \vec{x}\vec{\beta}(t) \right)$$ where each element of$\vec{\beta}(t)$is estimated recursively with an update equation. The update equation is simplified through a first order Taylor expansion of the score function and by adding a smoothness by using weighting depending on time changes with a uniform continuous kernel. For details see [@martinussen07] in chapter 6. The estimation method differs from other alternatives in R that use splines for the time varying effects. The baseline is$\lambda_0(t)=\exp(\alpha_0(t))$where$\alpha_0(t)$is estimated in a similar to way to$\vec{\beta}(t)$. Below we estimate a model similar to the previously fitted models. We set the last observation time (max.time) lower than in the previous model as there are issues with convergence if we do not. For the same reason we specify the effect of log(protime) to be constant (non-time varying). The cumulated coefficients$B(t) = \int_0^t \beta(t) dt$are plotted to inspect the fitted model. Thus, a constant effect should be roughly linear. library(timereg) cox_fit<- timecox(Surv(tstart / 365, tstop / 365, death == 2) ~ age + edema + log(albumin) + const(log(protime)) + log(bili), pbc2, start.time=0, max.time = 3000 / 365, # <-- decreased id = pbc2$id, bandwidth = 0.35) par(mfcol = c(3, 2)) plot(cox_fit)

The timecox packages provides two test for whether the coefficient is time varying or not:

summary(cox_fit)

The above test suggest that only edema might be "border line" time-varying.

# Dynamic hazard model

In this section, we will cover the dynamic hazard model with the logistic link function that is implemented in this package. The model is estimated with an EM-algorithm which are from [@fahrmeir94] and [@fahrmeir92] when an Extended Kalman Filter is used in the E-step are. Firstly, we will briefly cover the model. See vignette("ddhazard", "dynamichazard") for a more detailed explanation of the models. Secondly, we will turn to different ways of designing the model and fitting the model. The idea is that we discretize the outcomes into $1,2,\dots,d$ bins. In each bin, we observe whether the individual dies or survives. The state space model we are applying is of the form: $$\begin{array}{ll} \vec{y}_t = \vec{z}_t(\vec{\alpha}_t) + \vec{\epsilon}_t \qquad & \vec{\epsilon}_t \sim (\vec{0}, \text{Var}(\vec{y}_t | \vec{\alpha}t)) \ \vec{\alpha}{t + 1} = \mathbf{F}\vec{\alpha}_t + \mathbf{r}_t \qquad & \vec{\eta}_t \sim N(\vec{0}, \psi_t \mathbf{Q}) \ \end{array} , \qquad t = 1,\dots, d$$

where $y{it} \in {0,1}$ is an indicator for whether the $i$'th individual dies in interval $t$. $\cdots \sim (a, b)$ denotes a random variable with mean (or mean vector) $a$ and variance (or co-variance matrix) $b$. It needs not to be normally distributed. $z{it}(\vec{\alpha}_t) = h(\vec{\alpha}t\vec{x}{it})$ is the non-linear map from state space variables to mean outcomes where $h$ is the inverse link function. We use the logit model in this note. Thus, $h(x) = \text{logit}^{-1}(x)$. The current implementation supports first and second order random walk for the state equation. Further, we define the conditional covariance matrix in the observational equation as $\mathbf{H}_t(\vec{\alpha}_t) = \text{Var}(\vec{y}_t | \vec{\alpha}_t)$. $\psi_t$ is the length of the bin and is assumed equal for all values of $t$.

The unknown parameters are the starting value $\vec{\alpha}_0$ and co-variance matrix $\mathbf{Q}$. These are estimated with an EM-algorithm. The E-step is calculated using a Extended Kalman filter (EKF), Unscented Kalman filter (UKF) or sequential approximation of the modes. All are followed by a smoothing step. The result is smoothed predictions of $\vec{\alpha}_1,\dots, \vec{\alpha}_d$, smoothed co-variance matrix and smoothed correlation matrices that we need for the M-step to update $\vec{\alpha}_0$ and $\mathbf{Q}$.

## Estimation with Extended Kalman Filter

We start by estimating the model using the EKF where we let all coefficients follow a first order random walk:

library(dynamichazard) dd_fit <- ddhazard(Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), pbc2, id = pbc2$id, by = 100, max_T = 3600, Q_0 = diag(100, 6), Q = diag(0.01, 6), control = ddhazard_control(eps = .001)) plot(dd_fit) The arguments Q_0 and Q corresponds to the co-variance matrix at time zero and the co-variance matrix in the state equation. Q_0 will remain fixed while Q is the starting value in the first iteration of the EM algorithm after which we update$\mathbf{Q}$. Next, we plot the coefficient. That is, we plot the predicted latent variables$\vec{\alpha}_0,\dots,\vec{\alpha}_d$. Notice that the predicted coefficient are close to the estimates we saw previously for the GAM model. ## Extra iteration in the correction step Another idea is to take extra iterations in the correction step of the EKF. The motivation is that this step has the form of a Newton Rapshon algorithm as pointed out in [@fahrmeir92] and [@fahrmeir94]. Below, we estimate the model with potentially extra steps in the correction step. # Pre-computed sds of covariates sds <- c(1, 10, .2, .2, .1, 1) dd_fit <- ddhazard( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), pbc2, id = pbc2$id, by = 100, max_T = 3600, Q_0 = diag(10 / sds), Q = diag(0.01 / sds, 6), control = ddhazard_control( eps = .001, NR_eps = 0.0001, # Tolerance in correction step LR = .33 )) # Plot result plot(dd_fit)

First, we run the code with the NR_eps element of the list passed to the control argument set to something that is finite. The value is the threshold for the relative change of in the state vector in correction step of the EKF. We end the code above by creating plots of the new estimates.

## Estimation with the Unscented Kalman Filter

Another option is to perform the E-step using an unscented Kalman filter. This is done below. We start by setting the initial co-variance matrix $\mat{Q}_0$ to have large values in the diagonal elements:

dd_fit_UKF <- ddhazard( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), pbc2, id = pbc2$id, by = 100, max_T = 3600, Q_0 = diag(rep(1, 6)), Q = diag(rep(0.01, 6)), control = ddhazard_control(method = "UKF", beta = 0, alpha = 1, eps = 0.1, n_max = 1e4)) plot(dd_fit_UKF) Clearly, the plots of the estimates are not what we expected. The reason is that$\mat{Q}_0$'s diagonal entries are quite large. The square root of the diagonal entries are used to form the sigma points in the first iteration. Hence, we mainly get estimates that are either zero or one when$\mat{Q}_0$is a diagonal matrix with large entries. You can run the code below to see how the algorithm progress: # Not run tmp_file <- file("pick_some_file_name.txt") sink(tmp_file) dd_fit_UKF <- ddhazard( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), pbc2, id = pbc2$id, by = 100, max_T = 3600, Q_0 = diag(rep(1, 6)), Q = diag(rep(0.01, 6)), control = ddhazard_control(method = "UKF", beta = 0, alpha = 1, debug = T)) # <-- prints information in each iteration sink() close(tmp_file)

It will print quite a lot of information and hence it is recommended to use sink to write the output to a file. The main take away is that the conditional co-variance matrices accumulate in each iteration while the state vectors does not move. This motivates us to pick $\mat{Q}$ and $\mat{Q}_0$ more carefully. Our estimates from the EKF was:

diag(dd_fit$Q) which could motivate us to make the following choices: dd_fit_UKF <- ddhazard( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), pbc2, id = pbc2$id, by = 100, max_T = 3600, Q_0 = diag(c(0.001, 0.00001, rep(0.001, 4))) * 100, # <-- decreased Q = diag(0.0001, 6), # <-- decreased control = ddhazard_control(method = "UKF", beta = 0, alpha = 1, eps = 0.001)) plot(dd_fit_UKF)

This is comparable to the fits from the EKF and GAM model. The main points here are:

1. The UKF may work for certain data set. It may require tuning.
2. The algorithm is sensitive to the choice of $\mat{Q}$ and $\mat{Q}_0$. Further, there is dependence on hyperparameters $\alpha$, $\kappa$ and $\beta$ which we have not explored.
3. The output you get by setting debug in the list passed to the control argument can be useful. Combine this with sink because the output may be long.
4. The UKF has shown better performs than the EKF previously. Examples includes [@romanenko04], [@kandepu08], [@julier04], [@wan00] and chapter 11 of [@durbin12].

## Estimation with rank-one posterior modes approximation

Another method is to use the sequential rank-one approximation of the posterior posterior modes. This is done below. First, we estimate the model with the Extended Kalman filter and then we estimate the model with the rank-one posterior modes approximation method. We estimate both for comparison.

set.seed(7686280) # <-- Data is permuated so we set a seed
dd_fit_EKF <- ddhazard( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), pbc2, id = pbc2$id, by = 100, max_T = 3600, Q_0 = diag(100, 6), Q = diag(0.01, 6), control = ddhazard_control(eps = .001)) dd_fit_SMA <- ddhazard( Surv(tstart, tstop, death == 2) ~ age + edema + log(albumin) + log(protime) + log(bili), pbc2, id = pbc2$id, by = 100, max_T = 3600, Q_0 = diag(100, 6), Q = diag(0.01, 6), control = ddhazard_control( method = "SMA", # change estimation method eps = 0.001))

Next, we plot the two sets of predicted coefficients:

par(mfcol = c(2, 3)) for(i in 1:6){ plot(dd_fit_EKF, cov_index = i, col = "black") plot(dd_fit_SMA, cov_index = i, col = "red", add = TRUE) }
par_old <- par(no.readonly = T) par(mfcol = c(2, 3)) par(cex = par_old$cex * .85, mar = par_old$mar * 1.33) for(i in 1:6){ plot(dd_fit_EKF, cov_index = i, col = "black") plot(dd_fit_SMA, cov_index = i, col = "red", add = TRUE) }

The red curves are from the sequential rank-one approximation. The difference seems minor for this data set.

## Estimation with fixed effects

We may want to fit a model where we assume that some of the co-variates are fixed. For instance, we may want to fit a model where age, the intercept, log(protime) and log(albumin) are fixed. We fix these based on the previous plots where the effects seems no to be time-varying. The model can be fitted as by the following call:

dd_fit <- ddhazard( Surv(tstart, tstop, death == 2) ~ ddFixed_intercept() + ddFixed(age) + ddFixed(log(albumin)) + edema + ddFixed(log(protime)) + log(bili), pbc2, id = pbc2$id, by = 100, max_T = 3600, Q_0 = diag(100, 2), Q = diag(0.01, 2), control = ddhazard_control(eps = .001)) plot(dd_fit) The predicted curves seems similar. Moreover, the fixed effects are in agreement with the previous fits (they are printed below): dd_fit$fixed_effects

## Second order random walk

We will end by fitting a second order random walk to model where only the log(bili) effect is time-varying. The motivation is that the second order random walk tend to diverge more easily especially for small data sets. Further, the previous models seems to suggest that it is the only covariate where we may have a time-varying coefficient. First, we fit the model:

# Define formula form <- Surv(tstart, tstop, death == 2) ~ ddFixed_intercept() + ddFixed(age) + ddFixed(edema) + ddFixed(log(albumin)) + ddFixed(log(protime)) + log(bili) # Fit models dd_fit_EKF <- ddhazard(form, pbc2, id = pbc2$id, by = 100, max_T = 3600, order = 2, # <-- second order Q_0 = diag(5, 2), # <-- needs more elements Q = .0001, # <-- decreased control = ddhazard_control( eps = .0001, est_a_0 = FALSE)) We have to decrease the starting value of Q in the above to get the methods to converge. The fixed effects estimates are: dd_fit_EKF$fixed_effects

The predicted curve is:

plot(dd_fit_EKF)

We see that the predicted curve is more smooth as expected with a second order random walk. Further, we can confirm that the fixed effects are comparable with our previous fits.

# Summary

We have estimated a model using Generalized additive models with the mgcv package and a time-varying Cox model with the timereg package. Further, we have illustrated how the ddhazard function in the dynamichazard package can be used. All the fits have been comparable with the Generalized Additive model.