emfrail: Fitting shared frailty models with the EM algorithm

Description

Fitting shared frailty models with the EM algorithm

Usage

emfrail(formula, data, distribution = emfrail_dist(),
  control = emfrail_control(), model = FALSE, model.matrix = FALSE, ...)

Arguments

formula

A formula that contains on the left hand side an object of the type Surv and on the right hand side a +cluster(id) statement. Optionally, also a +terminal() statement may be added, and then a score test for association between the event process and the result in the specified column is performed. See details.

data

A data frame in which the formula argument can be evaluated

distribution

An object as created by emfrail_dist

control

An object as created by emfrail_control

model

Logical. Should the model frame be returned?

model.matrix

Logical. Should the model matrix be returned?

...

Other arguments, currently used to warn about deprecated argument names

Value

An object of class emfrail that contains the following fields:

coefficients

A named vector of the estimated regression coefficients

hazard

The breslow estimate of the baseline hazard at each event time point, in chronological order

var

The variance-covariance matrix corresponding to the coefficients and hazard, assuming $\theta$ constant

var_adj

The variance-covariance matrx corresponding to the coefficients and hazard, adjusted for the estimation of theta

theta

The point estimate of $\theta$. For the gamma and PVF family of distributions, this is the inverse of the estimated frailty variance.

var_theta

The variance of the estimated $\theta$

ci_theta

The likelihood-based 95% confidence interval for $\theta$

frail

The posterior (empirical Bayes) estimates of the frailty for each cluster

residuals

A list with two elements, cluster which is a vector that the sum of the cumulative hazards from each cluster for a frailty value of 1, and individual, which is a vector that contains the cumulative hazard corresponding to each row of the data, multiplied by the corresponding frailty estimate

tev

The time points of the events in the data set, this is the same length as hazard

nevents_id

The number of events for each cluster

loglik

A vector of length two with the log-likelihood of the starting Cox model and the maximized log-likelihood

ca_test

The results of the Commenges-Andersen test for heterogeneity

cens_test

The results of the test for dependence between a recurrent event and a terminal event, if the +terminal() statement is specified and the frailty distribution is gamma

formula

The original formula argument

distribution

The original distribution argument

control

The original control argument

The model.frame, if model = TRUE

The model.matrix, if model.matrix = TRUE

Details

The emfrail function fits shared frailty models for processes which have intensity $$\lambda(t) = z \lambda_0(t) \exp(\beta' \mathbf{x})$$ with a non-parametric (Breslow) baseline intensity $\lambda_0(t)$. The distribution of $z$ is usually described by one parameter $\theta$. The family of supported distributions can be one of gamma, positive stable or PVF (power-variance-family). For details, see the vignette and emfrail_dist .

The algorithm is described in detail in the vignette. The short version is as follows: The objective is to maximize a marginal likelihood with respect to 3 sets of parameters, $\theta > 0$ (the parameter of the frailty distribution), $\beta$ (a vector of regression coefficients) and $\lambda_0$ (a vector of "jumps" of the cumulative hazards at each event time point). The essence of the algorithm relies on the observation that, if $\theta$ would be fixed, then the maximization with respect to $\beta, \theta$ can be done with an EM algorithm. This procedure has been described, for the gamma frailty, in Nielsen (1992). The "inner" problem is, for a fixed of $\theta$, to calculate $$\widehat{L}(\theta) = \mathrm{max}_{\beta, \lambda_0} L(\beta, \lambda_0 | \theta).$$ which is done via an EM algorithm. The "outer" problem is to calculate $$\widehat{L} = \mathrm{max}_{\theta} \widehat{L}(\theta).$$

The "inner" problem is solved relying on agreg.fit from the survival package in the M step and with an E step which is done in R when there are closed form solutions and in C++ when the solutions are calculated with a recursive algorithm. The information matrix (given fixed $\theta$) is calculated using Louis' formula. The "outer" problem is solved relying on the maximizers in the nlm function, which also provides an estimate of the Hessian matrix. The remaining elements of the information matrix are approximated numerically.

Several options are available to the user. In the distribution argument, the frailty distribution and the starting value for $\theta$ can be specified, as well as whether the data consists of left truncated survival times or not (the default). In the control argument, several parameters control the maximization. The convergence criterion for the EM can be specified (or a maximum number of iterations). The "outer" procedure can be avoided altogether and then emfrail calculates only $\widehat{L}(\theta)$ at the given starting value.

The Commenges-Andersen test is a score test for heterogeneity which test the null hypothesis that all frailties are equal to 1. This only uses the information from the initial proportional hazards model without frailty.

The score test for dependent censoring test is detailed in Balan et al (2016). If significant, it may indicate that dependent censoring is present. A common scenario is when recurrent events are stopped by a terminal event.

Examples

Run this code

# NOT RUN {
dat <- survival::rats


m1 <- emfrail(formula = Surv(time, status) ~  rx + sex + cluster(litter),
data =  dat)
m1
summary(m1)
# }
# NOT RUN {
# for the Inverse Gaussian distribution
m2 <- emfrail(formula = Surv(time, status) ~  rx + sex + cluster(litter),
             data =  dat,
             distribution = emfrail_dist(dist = "pvf"))
m2

# for the PVF distribution with m = 0.75
m3 <- emfrail(formula = Surv(time, status) ~  rx + sex + cluster(litter),
              data =  dat,
              distribution = emfrail_dist(dist = "pvf", pvfm = 0.75))
m3

# for the positive stable distribution
m4 <- emfrail(formula = Surv(time, status) ~  rx + sex + cluster(litter),
              data =  dat,
              distribution = emfrail_dist(dist = "stable"))
m4
# }
# NOT RUN {
# Compare marginal log-likelihoods
# }
# NOT RUN {
models <- list(m1, m2, m3, m4)

logliks <- lapply(models,
                  function(x) -x$outer_m$value)

names(logliks) <- lapply(models,
                         function(x) with(x$distribution,
                                          ifelse(dist == "pvf",
                                                 paste(dist, "/", pvfm),
                                                 dist))
)

logliks
# }
# NOT RUN {
# Draw the profile log-likelihood
# }
# NOT RUN {
fr_var <- seq(from = 0.01, to = 1.4, length.out = 20)

# For gamma the variance is 1/theta (see parametrizations)
pll_gamma <- emfrail_pll(formula = Surv(time, status) ~  rx + sex + cluster(litter),
                         data =  dat,
                         values = 1/fr_var )
 plot(fr_var, pll_gamma,
     type = "l",
     xlab = "Frailty variance",
     ylab = "Profile log-likelihood")
# }
# NOT RUN {
# The same can be done with coxph, where variance is refered to as "theta"
pll_cph <- sapply(fr_var, function(fr)
  coxph(data =  dat, formula = Surv(time, status) ~ rx + sex + frailty(litter, theta = fr),
        method = "breslow")$history[[1]][[3]])

# Same for inverse gaussian
pll_if <- emfrail_pll(data =  dat,
                      formula = Surv(time, status) ~  rx + sex + cluster(litter),
                      distribution = emfrail_dist(dist = "pvf"),
                      .values = 1/fr_var )

# Same for pvf with a psoitive pvfm parameter
pll_pvf <- emfrail_pll(data =  dat,
                       formula = Surv(time, status) ~  rx + sex + cluster(litter),
                       distribution = emfrail_dist(dist = "pvf", pvfm = 1.5),
                       .values = 1/fr_var )

miny <- min(c(pll_gamma, pll_cph, pll_if, pll_pvf))
maxy <- max(c(pll_gamma, pll_cph, pll_if, pll_pvf))

plot(fr_var, pll_gamma,
     type = "l",
     xlab = "Frailty variance",
     ylab = "Profile log-likelihood",
     ylim = c(miny, maxy))
points(fr_var, pll_cph, col = 2)
lines(fr_var, pll_if, col = 3)
lines(fr_var, pll_pvf, col = 4)

legend(legend = c("gamma (emfrail)", "gamma (coxph)", "inverse gaussian", "pvf, m=1.5"),
       col = 1:4,
       lty = 1,
       x = 0,
       y = (maxy + miny)/2)
# }
# NOT RUN {
# Recurrent events
mod_rec <- emfrail(Surv(start, stop, status) ~ treatment + cluster(id), bladder1)
# The warnings appear from the Surv object, they also appear in coxph.

summary(mod_rec)

# Create a histogram of the estimated frailties

plot(mod_rec, type = "hist")

# or, with ggplot:
# }
# NOT RUN {
library(ggplot2)
sum_mod_rec <- summary(mod_rec)

ggplot(sum_mod_rec$frail, aes(x = z)) +
  geom_histogram() +
  ggtitle("Estimated frailties")

# Plot the frailty estimates with quantiles of the estimated distribution
ggplot(sum_mod_rec$frail, aes(x = id, y = z)) +
  geom_point() +
  ggtitle("Estimated frailties") +
  geom_errorbar(aes(ymin = lower_q, ymax = upper_q))

# We can do the same plot but with the ordered frailties:
ord <- order(sum_mod_rec$frail$z)
# we need new x coordinates for that:
ordering <- 1:length(ord)

ggplot(sum_mod_rec$frail[ord,], aes(x = ordering, y = z)) +
  geom_point() +
  ggtitle("Estimated frailties") +
  geom_errorbar(aes(ymin = lower_q, ymax = upper_q))

# How do we know which id is which one now?
# We can make an interactive plot with ggplotly
# To add text to elements we add id in aes()

library(plotly)
ggplotly(
  ggplot(sum_mod_rec$frail[ord,], aes(x = ordering, y = z)) +
    geom_point(aes(id = id)) +
    ggtitle("Estimated frailties") +
    geom_errorbar(aes(ymin = lower_q, ymax = upper_q, id = id))
)
# }
# NOT RUN {
# Plot marginal and conditional curves
# For recurrent events, the survival is not very meaningful

plot(mod_rec, type = "pred", lp = 0, quantity = "cumhaz")
#The strong frailty "drags down" the intensity



# Left truncation

# N.B. This code takes a longer time to run

# }
# NOT RUN {
# We simulate some data with truncation times
set.seed(1)
x <- sample(c(0,1), 5 * 300, TRUE)
u <- rep(rgamma(300,1,1), each = 5)
stime <- rexp(5*300, rate = u * exp(0.5 * x))
status <- ifelse(stime > 5, 0, 1)
stime[status] <- 5

ltime <- runif(5 * 300, min = 0, max = 2)
d <- data.frame(id = rep(1:300, each = 5),
                x = x,
                stime = stime,
                u = u,
                ltime = ltime,
                status = 1)
d_left <- d[d$stime > d$ltime,]

# The worst thing that can be done is to
# Ignore the left truncation:
mod_1 <- emfrail(Surv(stime, status)~ x + cluster(id), d_left)

# The so-and-so thing is to consider the delayed entry time,
# But do not "update" the frailty distribution accordingly
mod_2 <- emfrail(Surv(ltime, stime, status)~ x + cluster(id), d_left)

# This is identical with
mod_cox <- coxph(Surv(ltime, stime, status)~ x + frailty(id), data = d_left)


# The correct thing is to update the frailty.
mod_3 <- emfrail(Surv(ltime, stime, status)~ x + cluster(id), d_left,
                 distribution = emfrail_dist(left_truncation = TRUE))

summary(mod_1)
summary(mod_2)
summary(mod_3)
# }

Run the code above in your browser using DataLab