sim.survdata: Simulating duration data for the Cox proportional hazards model

Description

sim.survdata() randomly generates data frames containing a user-specified number of observations, time points, and covariates. It generates durations, a variable indicating whether each observation is right-censored, and "true" marginal effects. It can accept user-specified coefficients, covariates, and baseline hazard functions, and it can output data with time-varying covariates or using time-varying coefficients.

Usage

sim.survdata(N = 1000, T = 100, type = "none", hazard.fun = NULL,
  num.data.frames = 1, fixed.hazard = FALSE, knots = 8,
  spline = TRUE, X = NULL, beta = NULL, xvars = 3, mu = 0,
  sd = 0.5, covariate = 1, low = 0, high = 1, compare = median,
  censor = 0.1, censor.cond = FALSE)

Arguments

Number of observations in each generated data frame. Ignored if X is not NULL

The latest time point during which an observation may fail. Failures can occur as early as 1 and as late as T

type

If "none" (the default) data are generated with no time-varying covariates or coefficients. If "tvc", data are generated with time-varying covariates, and if "tvbeta" data are generated with time-varying coefficients (see details)

hazard.fun

A user-specified R function with one argument, representing time, that outputs the baseline hazard function. If NULL, a baseline hazard function is generated using the flexible-hazard method as described in Harden and Kropko (2018) (see details)

num.data.frames

The number of data frames to be generated

fixed.hazard

If TRUE, the same hazard function is used to generate each data frame. If FALSE (the default), different drawn hazard functions are used to generate each data frame. Ignored if hazard.fun is not NULL or if num.data.frames is 1

knots

The number of points to draw while using the flexible-hazard method to generate hazard functions (default is 8). Ignored if hazard.fun is not NULL

spline

If TRUE (the default), a spline is employed to smooth the generated cumulative baseline hazard, and if FALSE the cumulative baseline hazard is specified as a step function with steps at the knots. Ignored if hazard.fun is not NULL

A user-specified data frame containing the covariates that condition duration. If NULL, covariates are generated from normal distributions with means given by the mu argument and standard deviations given by the sd argument

beta

Either a user-specified vector containing the coefficients that for the linear part of the duration model, or a user specified matrix with rows equal to T for pre-specified time-varying coefficients. If NULL, coefficients are generated from normal distributions with means of 0 and standard deviations of 0.1

xvars

The number of covariates to generate. Ignored if X is not NULL

If scalar, all covariates are generated to have means equal to this scalar. If a vector, it specifies the mean of each covariate separately, and it must be equal in length to xvars. Ignored if X is not NULL

If scalar, all covariates are generated to have standard deviations equal to this scalar. If a vector, it specifies the standard deviation of each covariate separately, and it must be equal in length to xvars. Ignored if X is not NULL

covariate

Specification of the column number of the covariate in the X matrix for which to generate a simulated marginal effect (default is 1). The marginal effect is the difference in expected duration when the covariate is fixed at a high value and the expected duration when the covariate is fixed at a low value

low

The low value of the covariate for which to calculate a marginal effect

high

The high value of the covariate for which to calculate a marginal effect

compare

The statistic to employ when examining the two new vectors of expected durations (see details). The default is median

censor

The proportion of observations to designate as being right-censored

censor.cond

Whether to make right-censoring conditional on the covariates (default is FALSE, but see details)

Value

Returns an object of class "simSurvdata" which is a list of length num.data.frames for each iteration of data simulation. Each element of this list is itself a list with the following components:

`data`	The simulated data frame, including the simulated durations, the censoring variable, and covariates
`xdata`	The simulated data frame, containing only covariates
`baseline`	A data frame containing every potential failure time and the baseline failure PDF, baseline failure CDF, baseline survivor function, and baseline hazard function at each time point.
`xb`	The linear predictor for each observation
`exp.xb`	The exponentiated linear predictor for each observation
`betas`	The coefficients, varying over time if `type` is "tvbeta"
`ind.survive`	An (`N` x `T`) matrix containing the individual survivor function at time t for the individual represented by row n
`marg.effect`	The simulated marginal change in expected duration comparing the high and low values of the variable specified with `covariate`
`marg.effect.data`	The `X` matrix and vector of durations for the low and high conditions

Details

The sim.survdata function generates simulated duration data. It can accept a user-supplied hazard function, or else it uses the flexible-hazard method described in Harden and Kropko (2018) to generate a hazard that does not necessarily conform to any parametric hazard function. It can generate data with time-varying covariates or coefficients. For time-varying covariates type="tvc" it employs the permutational algorithm by Sylvestre and Abrahamowicz (2008). For time-varying coefficients with type="tvbeta", the first beta coefficient that is either supplied by the user or generated by the function is multiplied by the natural log of the failure time under consideration.

If fixed.hazard=TRUE, one baseline hazard is generated and the same function is used to generate all of the simulated datasets. If fixed.hazard=FALSE (the default), a new hazard function is generated with each simulation iteration.

The flexible-hazard method employed when hazard.fun is NULL generates a unique baseline hazard by fitting a curve to randomly-drawn points. This produces a wide variety of shapes for the baseline hazard, including those that are unimodal, multimodal, monotonically increasing or decreasing, and many other shapes. The method then generates a density function based on each baseline hazard and draws durations from it in a way that circumvents the need to calculate the inverse cumulative baseline hazard. Because the shape of the baseline hazard can vary considerably, this approach matches the Cox model<U+2019>s inherent flexibility and better corresponds to the assumed data generating process (DGP) of the Cox model. Moreover, repeating this process over many iterations in a simulation produces simulated samples of data that better reflect the considerable heterogeneity in data used by applied researchers. This increases the generalizability of the simulation results. See Harden and Kropko (2018) for more detail.

When generating a marginal effect, first the user specifies a covariate by typing its column number in the X matrix into the covariate argument, then specifies the high and low values at which to fix this covariate. The function calculates the differences in expected duration for each observation when fixing the covariate to the high and low values. If compare is median, the function reports the median of these differences, and if compare is mean, the function reports the median of these differences, but any function may be employed that takes a vector as input and outputs a scalar.

If censor.cond is FALSE then a proportion of the observations specified by censor is randomly and uniformly selected to be right-censored. If censor.cond is TRUE then censoring depends on the covariates as follows: new coefficients are drawn from normal distributions with mean 0 and standard deviation of 0.1, and these new coefficients are used to create a new linear predictor using the X matrix. The observations with the largest (100 x censor) percent of the linear predictors are designated as right-censored.

References

Harden, J. J. and Kropko, J. (2018). Simulating Duration Data for the Cox Model. Political Science Research and Methods https://doi.org/10.1017/psrm.2018.19

Sylvestre M.-P., Abrahamowicz M. (2008) Comparison of algorithms to generate event times conditional on time-dependent covariates. Statistics in Medicine 27(14):2618<U+2013>34.

Examples

Run this code

# NOT RUN {
simdata <- sim.survdata(N=1000, T=100, num.data.frames=2)
require(survival)
data <- simdata[[1]]$data
model <- coxph(Surv(y, failed) ~ X1 + X2 + X3, data=data)
model$coefficients ## model-estimated coefficients
simdata[[1]]$betas ## "true" coefficients

## User-specified baseline hazard
my.hazard <- function(t){ #lognormal with mean of 50, sd of 10
dnorm((log(t) - log(50))/log(10)) /
     (log(10)*t*(1 - pnorm((log(t) - log(50))/log(10))))
}
simdata <- sim.survdata(N=1000, T=100, hazard.fun = my.hazard)

## A simulated data set with time-varying covariates
# }
# NOT RUN {
simdata <- sim.survdata(N=1000, T=100, type="tvc", xvars=5, num.data.frames=1)
summary(simdata$data)
model <- coxph(Surv(start, end, failed) ~ X1 + X2 + X3 + X4 + X5, data=simdata$data)
model$coefficients ## model-estimated coefficients
simdata$betas ## "true" coefficients
# }
# NOT RUN {
## A simulated data set with time-varying coefficients
simdata <- sim.survdata(N=1000, T=100, type="tvbeta", num.data.frames = 1)
simdata$betas
# }

Run the code above in your browser using DataLab