The non-parametric step function (NPSF) approach to calculating expected durations from the Cox
proportional hazards model, described in Kropko and Harden (2018), uses the method proposed by
Cox and Oakes (1984, 107-109) for estimating the cumulative baseline hazard function. This
method is nonparametric and results in a step-function representation of the cumulative
baseline hazard.
Cox and Oakes (1984, 108) show that the cumulative baseline hazard function can be estimated
after fitting a Cox model by
$$\hat{H}_0(t) = \sum_{\tau_j < t}\frac{d_j}{\sum_{l\in\Re(\tau_j)}\hat{\psi}(l)},$$
where \(\tau_j\) represents time points earlier than \(t\), \(d_j\) is a count of the
total number of failures at \(\tau_j\), \(\Re(\tau_j)\) is the remaining risk set at \(\tau_j\),
and \(\hat{\psi}(l)\) represents the ELP from the Cox model for observations still in the
risk set at \(\tau_j\). This equation is used calculate the cumulative baseline hazard at
all time points in the range of observed durations. This estimate is a stepwise function
because time points with no failures do not contribute to the cumulative hazard, so the function
is flat until the next time point with observed failures.
We extend this method to obtain expected durations by first calculating the baseline survivor
function from the cumulative hazard function, using
$$\hat{S}_0(t) = \exp[-\hat{H}_0(t)].$$
Each observation's survivor function is related to the baseline survivor function by
$$\hat{S}_i(t) = \hat{S}_0(t)^{\hat{\psi}(i)},$$
where \(\hat{\psi}(i)\) is the exponentiated linear predictor (ELP) for observation \(i\).
These survivor functions can be used directly to calculate expected durations for each
observation. The expected value of a non-negative random variable can be calculated by
$$E(X) = \int_0^{\infty} \bigg(1 - F(t)\bigg)dt,$$
where \(F(.)\) is the cumulative distribution function for \(X\). In the case of a
duration variable \(t_i\), the expected duration is
$$E(t_i) = \int_0^T S_i(t)\,dt,$$
where \(T\) is the largest possible duration and \(S(t)\) is the individual's survivor
function. We approximate this integral with a right Riemann-sum by calculating the survivor
functions at every discrete time point from the minimum to the maximum observed durations,
and multiplying these values by the length of the interval between time points with observed failures:
$$E(t_i) \approx \sum_{t_j \in [0,T]} (t_j - t_{j-1})S_i(t_j).$$