# ddhazard Diagnostics

```
hook_output <- knitr::knit_hooks$get("output")
knitr::knit_hooks$set(output = function(x, options) {
lines <- options$output.lines
if (is.null(lines)) {
return(hook_output(x, options)) # pass to default hook
}
x <- unlist(strsplit(x, "\n"))
more <- "... output abbreviated ..."
if(is.list(lines)){
x <- lapply(lines, function(z) x[z])
for(i in 1:(length(x) - 1))
x[[i]] <- c(x[[i]], more)
x <- unlist(x)
}
else if (length(lines)==1) { # first n lines
if (length(x) > lines) {
# truncate the output, but add ....
x <- c(head(x, lines), more)
}
} else {
x <- c(more, x[lines], more)
}
# paste these lines together
x <- paste(c(x, ""), collapse = "\n")
hook_output(x, options)
})
knitr::opts_chunk$set(echo = TRUE, warning = F, message = F,
dpi = 100, cache = FALSE, dev = "png")
knitr::opts_knit$set(warning = F, message = F)
options(digits = 3)
knit_print.data.frame <- function(
x, ..., row.names = F, digits = getOption("digits")){
res <- paste(c(
"", "", knitr::kable(x, ..., row.names = row.names, digits = digits)),
collapse = "\n")
knitr::asis_output(res)
}
knit_print.matrix <- function(x, ..., digits = getOption("digits"), row.names = F, escape = T){
col.names <- rep("", ncol(x))
col.names <- if(is.null(colnames(x)))
rep("", ncol(x)) else colnames(x)
# The long table does not do well with the ( or [ symbol
# See http://tex.stackexchange.com/questions/228755/undefined-control-sequence-on-left-parenthesis-after-midrule-in-longtable
col.names[1] <- gsub("^\\(", "\\\\relax(", col.names[1], perl = TRUE)
col.names[1] <- gsub("^\\[", "\\\\relax[", col.names[1], perl = TRUE)
res <- paste(c(
"", "", knitr::kable(x, ..., row.names = row.names, col.names = col.names,
escape = escape, digits = digits)),
collapse = "\n")
knitr::asis_output(res)
}
```

\renewcommand{\vec}[1]{\bm{#1}}
\newcommand{\mat}[1]{\mathbf{#1}}
%
\newcommand{\Lparen}[1]{\left( #1\right)}
\newcommand{\Lbrack}[1]{\left[ #1\right]}
\newcommand{\Lbrace}[1]{\left { #1\right }}
\newcommand{\Lceil}[1]{\left \lceil #1\right \rceil}
\newcommand{\Lfloor}[1]{\left \lfloor #1\right \rfloor}
%
\newcommand{\propp}[1]{P\Lparen{#1}}
\newcommand{\proppCond}[2]{P\Lparen{\left. #1 \right\vert #2}}
%
\newcommand{\expecp}[1]{E\Lparen{#1}}
\newcommand{\expecpCond}[2]{E\Lparen{\left. #1 \right\vert #2}}
%
\newcommand{\varp}[1]{\textrm{Var}\Lparen{#1}}
\newcommand{\varpCond}[2]{\textrm{Var} \Lparen{\left. #1 \right\vert #2}}
%
\newcommand{\corpCond}[2]{\textrm{Cor} \Lparen{\left. #1 \right\vert #2}}
%
\newcommand{\covp}[1]{\textrm{Cov} \Lparen{#1}}
\newcommand{\covpCond}[2]{\textrm{Cov} \Lparen{\left. #1 \right\vert #2}}
%
\newcommand{\emNotee}[3]{#1*{\left. #2 \right\vert #3}}
\newcommand{\emNote}[4]{#1*{\left. #2 \right\vert #3}^{(#4)}}
%
\newcommand{\ukfNote}[2]{\mat{P}*{\vec{#1}, \vec{#2}}}
\newcommand{\ukfNotee}[3]{\mat{P}*{\vec{#1}*{#3}, \vec{#2}*{#3}}}
%
\newcommand{\diag}[1]{\text{diag}{\Lparen{#1}}}
\newcommand{\wvec}[1]{\widehat{\vec{#1}}}
\newcommand{\wmat}[1]{\widehat{\mat{#1}}}
\newcommand{\wtvec}[1]{\widetilde{\vec{#1}}}
\newcommand{\bvec}[1]{\bar{\vec{#1}}}
%
\newcommand{\deter}[1]{\left| #1 \right|}
%
\newcommand{\MyInd}[2]{\Lparen{#1}_{#2}}
%
\newcommand{\xyzp}[2]{#1\Lparen{#2}}
\newcommand{\xyzpCond}[3]{#1\Lparen{\left. #2 \right\vert #3}}

# Introduction

This vignette will show examples of how the `residuals`

and `hatvalues`

functions can be used for an object returned by `ddhazard`

. See `vignette("ddhazard", "dynamichazard")`

for details of the computations. Firstly, we will present all the tools that turned out to be useful for data sets. This is in the section called "First round of checks". Then we return to the data sets using the other methods. The latter part is included to illustrate how the function in this package work. This is in the section called "Second round of checks".

# First round of checks

## Data set 1: Prisoner recidivism

The first data set we will look at is an experimental study of recidivism of 432 male prisoners a year after being released from prison. The details of the data set are in [@rossi80]. The study involved randomly giving financial aid to the prisoners when they where released to see if this affected recidivism. The variables we will look at are:

`fin`

: 1 if the prisoner got aid and zero otherwise.`age`

: age at time of release.`prio`

: number of prior convictions.`employed.cumsum`

: Cumulative number of weeks employed from the date of release. This will vary through time.`event`

: 1 if the prisoner is rearrested.

A `.pdf`

file called `Appendix-Cox-Regression.pdf`

was previously on CRAN where they analyze this data set with the Cox regression model. They found:

- No very influential observation.
- No sign that the proportional-hazards assumption is violated. That is, no sign that the coefficients vary through time.
- Minor sign of non-linear effects.

### Loading the data set

We load the data set with the next line. The details of how the `.RData`

file is made is on the github site in the `vignettes/Diagnostics/`

folder.

```
load("Diagnostics/Rossi.RData")
# We only keep some of the columns
Rossi <- Rossi[
, c("id","start","stop","event", "fin", "age", "prio", "employed.cumsum")]
```

The data is in the typical start-stop form for Survival analysis. We print the number of individuals and show an example of how the data looks for one of the individuals:

```
# Number of unique individauls
length(unique(Rossi$id))
# Show example for a given individual
Rossi[Rossi$id == 2, ]
```

Next, we illustrate which of the variables are and which are not time-varying:

```
# See the varying and non-varying covariates
# The output shows the mean number of unique values for each individual
tmp <-
by(Rossi[, ], Rossi$id, function(dat)
apply(dat, 2, function(x) sum(!duplicated(x))))
colMeans(do.call(rbind, as.list(tmp)))
```

```
# Indivuals are observed for at most one year
stopifnot(setequal(unique(Rossi$stop), 1:52))
```

The events happens more less evenly across time after the first 8 weeks as the next plot shows. Hence, we may see an increasing intercept later since all individuals start at time zero. Thus, the risk sets only get smaller as time progress while roughly the same number of people gets rearrested:

```
# Individuals have event through out the period
plot(xtabs(~ Rossi$stop[Rossi$event == 1]), xlab = "Week number",
ylab = "Number of recidivism")
```

```
# All individuals have gabs of one week
stopifnot(all(
unlist(tapply(Rossi$stop, Rossi$id, diff)) == 1))
# People have at most one event
is_event <- Rossi[Rossi$id %in% Rossi$id[Rossi$event == 1], ]
stopifnot(all(tapply(is_event$event, is_event$id, sum) == 1))
# People alwas get arrested on the last record
stopifnot(all(
tapply(1:nrow(is_event), is_event$id, function(is){
rows <- is_event[is, ]
max(rows$stop) == rows$stop[rows$event == 1]
})))
```

### Estimation

We estimate the model with 4 terms and a intercept as follows:

```
library(dynamichazard)
dd_rossi <- ddhazard(
Surv(start, stop, event) ~ fin + age + prio + employed.cumsum,
data = Rossi, id = Rossi$id, by = 1, max_T = 52,
Q_0 = diag(10000, 5), Q = diag(.01, 5),
control = ddhazard_control(eps = .001, n_max = 250))
```

Then we plot the predicted coefficients:

`plot(dd_rossi)`

The dashed lines are 95% confidence bounds from the smoothed covariance matrices. Both the intercept and the age seems to have time-varying coefficients.

### Hat values

We start by looking at the "hat-like" values which are suggested in the ddhazard vignette. These are computed by calling the `hatvalues`

function as follows:

`hats <- hatvalues(dd_rossi)`

The returned object is list with a matrix for each interval. Each matrix has a column for the hat values, the row number in the original data set and the id of the individual the hat values belongs to:

```
str(hats[1:3]) # str of first three elements
head(hats[[1]], 10) # Print the head of first matrix
```

```
stack_hats <- function(hats){
# Stack
resids_hats <- data.frame(do.call(rbind, hats), row.names = NULL)
# Add the interval number
n_rows <- unlist(lapply(hats, nrow))
interval_n <- unlist(sapply(1:length(n_rows), function(i) rep(i, n_rows[i])))
resids_hats <- cbind(resids_hats, interval_n = interval_n)
# Sort by id and interval number
resids_hats <-
resids_hats[order(resids_hats$id, resids_hats$interval_n), ]
resids_hats
}
```

We have defined a function to stack the hat values for each individual called `stack_hats`

such that we get a single matrix instead of list of matrices. Further, the function also adds an extra column for the interval number. The definition of the function is printed at the end of this vignette. The stacked values looks as follows:

```
hats_stacked <- stack_hats(hats)
head(hats_stacked)
```

The hat values are skewed with a few large values as shown in the histogram below:

```
# Draw histogram of hat values
hist(log10(hats_stacked$hat_value), main = "",
xlab = "Histogram of log10 of Hat values")
```

We find the maximum hat value for each individual and print the largest of them:

```
# Print the largest values
max_hat <- tapply(hats_stacked$hat_value, hats_stacked$id, max)
head(sort(max_hat, decreasing = T), 5)
```

The names of the returned vector is the id of the individual. One seems to stand out. Next, we plot the hat values against time and highlight the individuals with the largest maximum hat value by giving them a red color and adding a number for the rank of their maximum hat value:

```
# We will highlight the individuals with the highest hatvalues
is_large <-
names(head(sort(max_hat, decreasing = T), 5))
# Plot hat values
plot(range(hats_stacked$interval_n), c(0, 0.03), type = "n",
xlab = "Interval number", ylab = "Hat value")
invisible(
by(hats_stacked, hats_stacked$id, function(rows){
has_large <- rows$id[1] %in% is_large
col <- if(has_large) "Red" else "Black"
lines(rows$interval_n, rows$hat_value, lty = 2,
col = col)
if(has_large){
pch <- as.character(which(rows$id[1] == is_large))
points(rows$interval_n, rows$hat_value, pch = pch, col = col)
}
}))
```

We print the last record for each of the above shown in red to get an idea of why their hat values are large:

```
# These are the individuals id
is_large
# We print the last record each of these
Rossi_subset <- Rossi[
unlist(sapply(is_large, function(x) which(x == Rossi$id))), ]
Rossi_subset <- Rossi_subset[nrow(Rossi_subset):1, ]
Rossi_subset[!duplicated(Rossi_subset$id), ]
```

Some have a large number of prior convictions as shown in the next plot. Moreover, we can see by the next histogram that the number of prior convictions is skewed:

```
tmp <- xtabs(~Rossi$prio[!duplicated(Rossi$id)])
plot(as.numeric(names(tmp)), c(tmp), ylab = "frequency", type = "h",
xlab = "Number of prior convictions")
```

This could suggest a transformation of the variables. Thus, we try with the logarithm of the value plus one with one added to avoid $\log(0)$ (with one arbitrarily chosen):

```
tmp <- xtabs(~log(Rossi$prio[!duplicated(Rossi$id)] + 1))
plot(as.numeric(names(tmp)), c(tmp), ylab = "frequency", type = "h",
xlab = "Log(Number of prior convictions + 1)")
```

Below, we make a fit where we use this transformation of prior convictions instead:

```
dd_rossi_trans <- ddhazard(
Surv(start, stop, event) ~ fin + age + log(prio + 1) + employed.cumsum,
data = Rossi, id = Rossi$id, by = 1, max_T = 52,
Q_0 = diag(10000, 5), Q = diag(.01, 5),
control = ddhazard_control(eps = .001, n_max = 250))
plot(dd_rossi_trans)
```

computing the hat values and making a similar plot to the one before shows that the individuals are much less influential.

```
hats <- hatvalues(dd_rossi_trans)
hats_stacked <- stack_hats(hats)
```

A question is whether the new model fits better. Thus, we compare the mean logistic loss of the two models in-sample:

```
# Fit the two models
f1 <- ddhazard(
Surv(start, stop, event) ~ fin + age + prio + employed.cumsum,
data = Rossi, id = Rossi$id, by = 1, max_T = 52,
Q_0 = diag(10000, 5), Q = diag(.01, 5),
control = ddhazard_control(eps = .001, n_max = 250))
f2 <- ddhazard(
Surv(start, stop, event) ~ fin + age + log(prio + 1) + employed.cumsum ,
data = Rossi, id = Rossi$id, by = 1, max_T = 52,
Q_0 = diag(10000, 5), Q = diag(.01, 5),
control = ddhazard_control(eps = .001, n_max = 250))
# Compute residuals
res1 <- residuals(f1, type = "pearson")
res2 <- residuals(f2, type = "pearson")
# Compute logistic loss
log_error1 <- unlist(
lapply(res1$residuals, function(x)
ifelse(x[, "Y"] == 1, log(x[, "p_est"]), log(1 - x[, "p_est"]))))
log_error2 <- unlist(
lapply(res2$residuals, function(x)
ifelse(x[, "Y"] == 1, log(x[, "p_est"]), log(1 - x[, "p_est"]))))
# Compare mean
print(c(res1 = mean(log_error1), res2 = mean(log_error2)),
digits = 8)
```

The difference is very small.

## Data set 2: Worcester Heart Attack Study

Next, We will look at the Worcester Heart Attack Study. The dataset contains individuals who had a heart attack and is then followed up to check if they die within the following days. Below, we load the the data and plot the date of deaths:

```
load("Diagnostics/whas500.RData")
hist(whas500$lenfol[whas500$fstat == 1], breaks = 20,
xlab = "Time of death", main = "")
```

The large peak at the start is due to a lot of individuals who dies doing the hospital stay shortly after their first heart attack. All covariates in the dataset are time-invariant records from the day that the individual had the first heart attack. We will look at gender, age, BMI, binary for whether the individual has a history of cardiovascular disease (`cvd`

) and heart rate (`hr`

). The first entries and summary stats are printed below:

```
# We only keep some of the columns
whas500 <- whas500[
, c("id", "lenfol", "fstat", "gender", "age", "bmi", "hr", "cvd")]
# First rows
head(whas500, 10)
# Summary stats
summary(whas500[, c("age", "bmi", "hr", "gender", "cvd")])
```

### Estimation

We estimate the model as follows:

```
dd_whas <- ddhazard(
Surv(lenfol, fstat) ~ gender + age + bmi + hr + cvd,
data = whas500, by = 100, max_T = 2000,
Q_0 = diag(10000, 6), Q = diag(.1, 6),
control = ddhazard_control(eps = .001))
plot(dd_whas)
```

The intercept drops in the first period which possibly is due to the initial high number of deaths right after the first heart attack. We further simplify the model the model by removing `gender`

variable which seems to be zero:

```
dd_whas <- ddhazard(
Surv(lenfol, fstat) ~ age + bmi + hr + cvd,
data = whas500, by = 100, max_T = 2000,
Q_0 = diag(10000, 5), Q = diag(.1, 5),
control = ddhazard_control(eps = .001))
plot(dd_whas)
```

### Leaving out a covaraite

Suppose that we had not included the age to start with:

```
dd_whas_no_age <- ddhazard(
Surv(lenfol, fstat) ~ bmi + hr + cvd, # No age
data = whas500, by = 100, max_T = 1700,
Q_0 = diag(10000, 4), Q = diag(.1, 4),
control = ddhazard_control(eps = .001))
plot(dd_whas_no_age)
```

Then we could think that we would be able to see that the covariate was left out using the Pearson residuals as in a regular logistic regression. We compute the Pearson residuals to see if this would work:

`obs_res <- residuals(dd_whas_no_age, type = "pearson")`

The returned object is a list with two elements. One element denotes the type of the residuals and another contains the residuals. The latter is a list with a matrix for each interval. Each matrix has four columns for the residuals, the computed likelihood of an event, the outcome and the row number in the initial data matrix for those that were at risk in the interval. This is illustrated in the next lines:

```
# We have matrix for each interval
length(obs_res$residuals)
# Shows the structure of the matrices. We only print take the first 5 matrices
str(obs_res$residuals[1:5])
# Print the first entries of the first interval
head(obs_res$residuals[[1]])
```

The list of matrices is un-handy so we have defined a function called `stack_residuals`

to stack the matrices, add the interval number and the id of that the residuals belong to in. The definition of the function is given at the end of this vignette.

```
stack_residuals <- function(fit, resids){
if(!inherits(resids, "ddhazard_residual"))
stop("Residuals must have class 'ddhazard_residual'")
if(!inherits(fit, "ddhazard"))
stop("fit must have class 'ddhazard'")
# Stack the residuals
resids_stacked <-
data.frame(do.call(rbind, resids[[1]]), row.names = NULL)
# Add the interval number and id
n_rows <- unlist(lapply(resids$residuals, nrow))
interval_n <- unlist(sapply(1:length(n_rows), function(i) rep(i, n_rows[i])))
resids_stacked <- cbind(
resids_stacked,
interval_n = interval_n,
id = fit$id[resids_stacked$row_num])
# Sort by id and interval number
resids_stacked <-
resids_stacked[order(resids_stacked$id, resids_stacked$interval_n), ]
resids_stacked
}
```

```
resids_stacked <- stack_residuals(fit = dd_whas_no_age, resids = obs_res)
# print the first entries
head(resids_stacked, 10)
```

Next, we add the age variable to the stacked residuals, stratify the age variable, compute cumulated mean across each stratum in each interval and plot against time:

```
# Add age variable
resids_stacked$age <-
whas500$age[resids_stacked$row_num]
# Stratify
age_levels <- quantile(whas500$age, seq(0, 1, by = .2))
age_levels
resids_stacked$age_cut <- cut(resids_stacked$age, age_levels)
# Compute the means
cut_means <-
tapply(resids_stacked$residuals,
list(resids_stacked$interval_n, resids_stacked$age_cut),
mean)
head(cut_means)
# Plot against time
colfunc <- colorRampPalette(c("Black", "Blue"))
cols <- colfunc(ncol(cut_means))
matplot(dd_whas_no_age$times[-1], apply(cut_means, 2, cumsum),
type = "l", col = cols, xlab = "Time", lwd = 2,
lty = 1, ylab = "Cumulated mean Pearson residuals")
abline(h = 0, lty = 2)
legend("topleft", bty = "n",
lty = rep(1, ncol(cut_means)),
legend = colnames(cut_means),
col = cols, lwd = 2,
cex = par()$cex * .8)
```

We see that the older and youngest strata stand out and deviates from zero. Hence, suggesting that the age variable should have been in the model.

# Second round of checks

We will illustrate some other uses of the `residuals`

function in this section. This part is included more to show how the function works as they do not "discover" anything new about the data sets.

## Residuals from state space vector

We start by looking at the standardized state space errors as explained in the ddhazard vignette. We may expect these to be standard iid normal distributed. First, we compute the values with the `residuals`

by passing `type = "std_space_error"`

with the first fit we made with the Rossi data set:

```
stat_res <- residuals(dd_rossi, type = "std_space_error")
str(stat_res)
```

The output is a list with the residuals, smoothed covariance matrices for the errors and a binary variable to indicate whether or not the residuals are standardized. Next, we can plot the residuals as follows:

`stopifnot(all(abs(stat_res$residuals) < 1.96))`

`plot(stat_res, mod = dd_rossi, p_cex = .75, ylim = c(-2, 2))`

The variables appears to be nothing like standard normal (we have $52 \cdot 6$ residuals with no one outside $\pm 1.96$). Another idea is only marginally standardize (that is, not rotate the residuals). This can be done as follows:

```
# Get non-standardized residuals
stat_res <- residuals(dd_rossi, type = "space_error")
# Standardize marginally
for(i in 1:nrow(stat_res$residuals))
stat_res$residuals[i, ] <- stat_res$residuals[i, ] /
sqrt(diag(stat_res$Covariances[, , i]))
# Plot
par(mfcol = c(2, 3))
for(i in 1:ncol(stat_res$residuals))
plot(stat_res, mod = dd_rossi, p_cex = .75, cov_index = i,
ylab = colnames(stat_res$residuals)[i],
ylim = c(-2, 2))
```

which I again find hard to draw any conclusion from. It seems like there is structure in the errors for the intercept and `fin`

. However, this might be what we expected. For instance, we may expect the intercept to increase through time (i.e. not be random as assumed by the model). Further, assuming that the covariance estimate are conservative then there might be an error for `prio`

in interval 20-25 and an error in `age`

in interval 10-12 that seem extreme. We can do the same thing for the first fit with the Rossi data set:

```
stat_res <- residuals(dd_whas, type = "std_space_error")
plot(stat_res, mod = dd_whas, ylim = c(-4, 4), p_cex = .8)
```

Again, the errors seems to have to low variance to be standard normal apart from one which is more than three standard deviations away. I am not sure what to conclude from this. A question is whether we would see the same for a simulated data set where the true coefficients follows a random walk. We check this in the next paragraph.

#### Simmulation

```
if(requireNamespace("dichromat", quietly = T, warn.conflicts = F)){
cols <- c("#000000", dichromat::colorschemes$Categorical.12[c(6, 8, 10)])
} else
cols <- c("#000000", "#009E73", "#e79f00", "#9ad0f3")
```

We start by getting a the definition of the `test_sim_func_logit`

function in the `dynamichazard`

package which is not exported. We will use it to simulate the individuals series:

`sim_func <- with(environment(ddhazard), test_sim_func_logit)`

Then we simulate the coefficients and plot them:

```
# Simulate the coefficients
set.seed(556189)
betas <- matrix(rnorm(21 * 4), ncol = 4)
betas[, 1] <- betas[, 1] * 0.25 # reduce the variance of the intercept
betas <- apply(betas, 2, cumsum) # accumulate the innovations
betas[, 1] <- betas[, 1] - 4 # we reduce the intercept
# Plot the simulated coefficients
matplot(betas, col = cols, lty = 1, type = "l")
```

We reduce the variance of the intercept and decrease the intercept to yield a lower baseline risk of an event. The individuals and their outcomes are simulated as follows:

- Each individual start at time $0$.
- The individuals covariates are simulated from $\text{Unif}(-1, 1)$.
- We update the individuals covariate with intervals times drawn from a $\text{Exp}(1/10)$.

This is done in the following call:

```
# Simulate series
sim_dat <- sim_func(
n_series = 500, # number of individuals
t_max = 20, # the last stop time
x_range = 2, # what is the uniform range to draw from
x_mean = 0, # the mean of the uniform covariates
n_vars = 3, # 4 - 1 for the intercept
lambda = 1/10, # lambda in the exponential distribution for time
# between updates of covariate vectors
betas = betas)
```

The first rows of the simulation looks as follows

`head(sim_dat$res, 10)`

Next, we estimate the model and compare the estimated coefficients with the fit:

```
f1 <- ddhazard(
Surv(tstart, tstop, event) ~ x1 + x2 + x3,
sim_dat$res, by = 1, max_T = 20, id = sim_dat$res$id,
Q_0 = diag(10000, 4), Q = diag(.1, 4),
control = ddhazard_control(eps = .001))
matplot(betas, col = cols, lty = 1, type = "l")
matplot(f1$state_vecs, col = cols, lty = 2, type = "l", add = T)
```

The full lines are the true coefficients and the dashed lines are the estimates. The question is then how the standardized state space errors look. We plot these below

```
stat_res <- residuals(f1, type = "std_space_error")
plot(stat_res, mod = f1, p_cex = .8)
```

The errors seems more variable than before. We make a QQ-plot below. Apart from one error it seems quite close to a normal distribution.

```
qqnorm(c(stat_res$residuals), pch = 16, cex = .8)
qqline(c(stat_res$residuals))
```

Re-running the above three times gives the following plots:

```
# CHECK: arguments below match those above
par(mfcol = c(3, 3))
set.seed(189780)
for(i in 1:3){
# Simulate the coefficients
betas <- matrix(rnorm(21 * 4), ncol = 4)
betas[, 1] <- betas[, 1] * 0.25
betas <- apply(betas, 2, cumsum)
betas[, 1] <- betas[, 1] - 4
# Simulate series
sim_dat <- sim_func(
n_series = 500,
t_max = 20,
x_range = 2,
x_mean = 0,
n_vars = 3,
lambda = 1/10,
betas = betas)
# Fit
f1 <- ddhazard(
Surv(tstart, tstop, event) ~ x1 + x2 + x3,
sim_dat$res, by = 1, max_T = 20, id = sim_dat$res$id,
Q_0 = diag(10000, 4), Q = diag(.1, 4),
control = ddhazard_control(eps = .001))
# Plot coefficients
matplot(betas, col = cols, lty = 1, type = "l",
ylim = range(betas, f1$state_vecs))
matplot(f1$state_vecs, col = cols, lty = 2, type = "l", add = T)
# Plot errors
stat_res <- residuals(f1, type = "std_space_error")
plot(stat_res, mod = f1, p_cex = .8)
# QQ-plos
qqnorm(c(stat_res$residuals), pch = 16, cex = .8)
qqline(c(stat_res$residuals))
}
```

It is worth stressing that neither the initial seed nor the parameters have been selected here to yield a particular result. The take away for me is that the previous finding with the Rossi data set and WHAS data set is an artifact of the data set and model specification and not something we would see if we have an actual random walk model it seems.

## Hat values for Worcester Heart Attack Study

In the following paragraphs, we check hat values for Worcester Heart Attack Study data set. First, we compute them:

```
hats <- hatvalues(dd_whas)
hats_stacked <- stack_hats(hats)
```

Then we compute the cumulated hat values for each individuals and plot against time

```
# Compute cumulated hat values
hats_stacked$hats_cum <- unlist(tapply(
hats_stacked$hat_value, hats_stacked$id, cumsum))
# Plot the cumulated residuals for each individual
plot(c(1, 20), range(hats_stacked$hats_cum), type = "n",
xlab = "Interval number", ylab = "Cumulated hat values")
invisible(
tapply(hats_stacked$hats_cum, hats_stacked$id, lines,
col = gray(0, alpha = .2)))
```

Three individuals seems to stand out. We look at these in the next line:

```
max_cum <- tapply(hats_stacked$hats_cum, hats_stacked$id, max)
is_max <- order(max_cum, decreasing = T)[1:3]
is_max
# The records for these
whas500[is_max, ]
```

One of them has a high heart rate while the two others have a low BMI (below 18.5 is underweight). Another idea is to look at the average up to the given time of the hat values. The motivation is the two lines that end around the 5'th interval either because they die or are right censored. Hence, we normalize by the interval number and plot against time

```
# Averages of hat values
hats_stacked$hats_avg <- hats_stacked$hats_cum / hats_stacked$interval_n
# Plot against time
plot(c(1, 20), range(hats_stacked$hats_avg), type = "n",
xlab = "Interval number", ylab = "Avg. hat values")
invisible(
tapply(hats_stacked$hats_avg, hats_stacked$id, lines,
col = gray(0, alpha = .2)))
```

Indeed the two stands. Hence, we look further at the five largest values:

```
max_avg <- tapply(hats_stacked$hats_avg, hats_stacked$id, max)
is_max_avg <- order(max_avg, decreasing = T)[1:5]
is_max_avg
# The records for these
whas500[is_max_avg, ]
```

The two new ones with the highest values are one who is old and another with an extreme heart rate (a typical rule rule of thumb is that the maximum heart rate is $220$ less your age!). In order to show this, we make the following plots:

```
# Setup parameters for the plot
cols <- rep("Black", 500)
cols[1:500 %in% is_max_avg] <- "Blue"
cols[1:500 %in% is_max] <- "Red"
cexs <- ifelse(cols != "Black", par()$cex * 1.25, par()$cex * .75)
pchs <- ifelse(whas500$fstat == 1 & whas500$lenfol <= 2000, 16, 1)
plot(whas500[, c("age", "hr", "bmi")], pch = pchs, cex = cexs, col = cols)
```

Filled circles are cases and non-filled circles are censored. The blue dots are the ones with a high maximum average hat value while the red one have both a high maximum average and a high cumulative hat values. Plotting against time shows the censoring is is clustered in time as shown in the next two plots:

```
plot(whas500$lenfol, whas500$hr, col = cols, pch = pchs,
xlab = "Follow-up time", ylab = "Heart rate")
plot(whas500$lenfol, whas500$age, col = cols, pch = pchs,
xlab = "Follow-up time", ylab = "Age")
```

This could motivate us to stop the slightly before the last cluster of censoring of individuals occurs. Thus, we set the final time (the `max_T`

argument to `ddhazard`

) to 1700 instead of 2000 in the next code block where we re-estimate the model. Further, we make a fit without the "extreme" individuals:

```
dd_whas <- ddhazard(
Surv(lenfol, fstat) ~ age + bmi + hr + cvd,
data = whas500, by = 100, max_T = 1700,
Q_0 = diag(10000, 5), Q = diag(.1, 5),
control = ddhazard_control(eps = .001))
dd_whas_no_extreme <- ddhazard(
Surv(lenfol, fstat) ~ age + bmi + hr + cvd,
data = whas500[-c(is_max, is_max_avg), ], # we exclude the "extreme" persons
by = 100, max_T = 1700,
Q_0 = diag(10000, 5), Q = diag(.1, 5))
```

We plot the two sets of predicted coefficients next:

```
par(mfcol = c(2,3))
for(i in 1:5){
plot(dd_whas, cov_index = i)
plot(dd_whas_no_extreme, cov_index = i, add = T, col = "DarkBlue")
}
```

The blue line is the predicted coefficients without the "extreme" individuals. The difference seems minor

# Function definitions

The functions used in this vignette that are no included in the package are defined below: