sim.survdata()
randomly generates data frames containing a user-specified number
of observations, time points, and covariates. It generates durations, a variable indicating
whether each observation is right-censored, and "true" marginal effects.
It can accept user-specified coefficients, covariates, and baseline hazard functions, and it can
output data with time-varying covariates or using time-varying coefficients.
sim.survdata(N = 1000, T = 100, type = "none", hazard.fun = NULL,
num.data.frames = 1, fixed.hazard = FALSE, knots = 8,
spline = TRUE, X = NULL, beta = NULL, xvars = 3, mu = 0,
sd = 0.5, covariate = 1, low = 0, high = 1, compare = median,
censor = 0.1, censor.cond = FALSE)
Number of observations in each generated data frame. Ignored if X
is not NULL
The latest time point during which an observation may fail. Failures can occur as early as 1 and as late as T
If "none" (the default) data are generated with no time-varying covariates or coefficients. If "tvc", data are generated with time-varying covariates, and if "tvbeta" data are generated with time-varying coefficients (see details)
A user-specified R function with one argument, representing time, that outputs the baseline hazard function.
If NULL
, a baseline hazard function is generated using the flexible-hazard method as described in Harden and
Kropko (2018) (see details)
The number of data frames to be generated
If TRUE
, the same hazard function is used to generate each data frame. If FALSE
(the default),
different drawn hazard functions are used to generate each data frame. Ignored if hazard.fun
is not NULL
or if
num.data.frames
is 1
The number of points to draw while using the flexible-hazard method to generate hazard functions (default is 8).
Ignored if hazard.fun
is not NULL
If TRUE
(the default), a spline is employed to smooth the generated cumulative baseline hazard, and if FALSE
the cumulative baseline hazard is specified as a step function with steps at the knots. Ignored if hazard.fun
is not NULL
A user-specified data frame containing the covariates that condition duration. If NULL
, covariates are generated from
normal distributions with means given by the mu
argument and standard deviations given by the sd
argument
Either a user-specified vector containing the coefficients that for the linear part of the duration model, or
a user specified matrix with rows equal to T
for pre-specified time-varying coefficients.
If NULL
, coefficients are generated from normal distributions with means of 0 and standard deviations of 0.1
The number of covariates to generate. Ignored if X
is not NULL
If scalar, all covariates are generated to have means equal to this scalar. If a vector, it specifies the mean of each covariate separately,
and it must be equal in length to xvars
. Ignored if X
is not NULL
If scalar, all covariates are generated to have standard deviations equal to this scalar. If a vector, it specifies the standard deviation
of each covariate separately, and it must be equal in length to xvars
. Ignored if X
is not NULL
Specification of the column number of the covariate in the X
matrix for which to generate a simulated marginal effect (default is 1).
The marginal effect is the difference in expected duration when the covariate is fixed at a high value and the expected duration when the covariate is fixed
at a low value
The low value of the covariate for which to calculate a marginal effect
The high value of the covariate for which to calculate a marginal effect
The statistic to employ when examining the two new vectors of expected durations (see details). The default is median
The proportion of observations to designate as being right-censored
Whether to make right-censoring conditional on the covariates (default is FALSE
, but see details)
Returns an object of class "simSurvdata
" which is a list of length num.data.frames
for each iteration of data simulation.
Each element of this list is itself a list with the following components:
data |
The simulated data frame, including the simulated durations, the censoring variable, and covariates |
xdata |
The simulated data frame, containing only covariates |
baseline |
A data frame containing every potential failure time and the baseline failure PDF, baseline failure CDF, baseline survivor function, and baseline hazard function at each time point. |
xb |
The linear predictor for each observation |
exp.xb |
The exponentiated linear predictor for each observation |
betas |
The coefficients, varying over time if type is "tvbeta" |
ind.survive |
An (N x T ) matrix containing the individual survivor function at
time t for the individual represented by row n |
marg.effect |
The simulated marginal change in expected duration comparing the high and low values of
the variable specified with covariate |
marg.effect.data |
The X matrix and vector of durations for the low and high conditions |
The sim.survdata
function generates simulated duration data. It can accept a user-supplied
hazard function, or else it uses the flexible-hazard method described in Harden and Kropko (2018) to generate
a hazard that does not necessarily conform to any parametric hazard function. It can generate data with time-varying
covariates or coefficients. For time-varying covariates type="tvc"
it employs the permutational algorithm by Sylvestre and Abrahamowicz (2008).
For time-varying coefficients with type="tvbeta"
, the first beta coefficient that is either supplied by the user or generated by
the function is multiplied by the natural log of the failure time under consideration.
If fixed.hazard=TRUE
, one baseline hazard is generated and the same function is used to generate all of the simulated
datasets. If fixed.hazard=FALSE
(the default), a new hazard function is generated with each simulation iteration.
The flexible-hazard method employed when hazard.fun
is NULL
generates a unique baseline hazard by fitting a curve to
randomly-drawn points. This produces a wide variety
of shapes for the baseline hazard, including those that are unimodal, multimodal, monotonically increasing or decreasing, and many other
shapes. The method then generates a density function based on each baseline hazard and draws durations from it in a way that circumvents
the need to calculate the inverse cumulative baseline hazard. Because the shape of the baseline hazard can vary considerably, this approach
matches the Cox model<U+2019>s inherent flexibility and better corresponds to the assumed data generating process (DGP) of the Cox model. Moreover,
repeating this process over many iterations in a simulation produces simulated samples of data that better reflect the considerable
heterogeneity in data used by applied researchers. This increases the generalizability of the simulation results. See Harden and Kropko (2018)
for more detail.
When generating a marginal effect, first the user specifies a covariate by typing its column number in the X
matrix into the covariate
argument, then specifies the high and low values at which to fix this covariate. The function calculates the differences in expected duration for each
observation when fixing the covariate to the high and low values. If compare
is median
, the function reports the median of these differences,
and if compare
is mean
, the function reports the median of these differences, but any function may be employed that takes a vector as input and
outputs a scalar.
If censor.cond
is FALSE
then a proportion of the observations specified by censor
is randomly and uniformly selected to be right-censored.
If censor.cond
is TRUE
then censoring depends on the covariates as follows: new coefficients are drawn from normal distributions with mean 0 and
standard deviation of 0.1, and these new coefficients are used to create a new linear predictor using the X
matrix. The observations with the largest
(100 x censor
) percent of the linear predictors are designated as right-censored.
Harden, J. J. and Kropko, J. (2018). Simulating Duration Data for the Cox Model. Political Science Research and Methods https://doi.org/10.1017/psrm.2018.19
Sylvestre M.-P., Abrahamowicz M. (2008) Comparison of algorithms to generate event times conditional on time-dependent covariates. Statistics in Medicine 27(14):2618<U+2013>34.
# NOT RUN {
simdata <- sim.survdata(N=1000, T=100, num.data.frames=2)
require(survival)
data <- simdata[[1]]$data
model <- coxph(Surv(y, failed) ~ X1 + X2 + X3, data=data)
model$coefficients ## model-estimated coefficients
simdata[[1]]$betas ## "true" coefficients
## User-specified baseline hazard
my.hazard <- function(t){ #lognormal with mean of 50, sd of 10
dnorm((log(t) - log(50))/log(10)) /
(log(10)*t*(1 - pnorm((log(t) - log(50))/log(10))))
}
simdata <- sim.survdata(N=1000, T=100, hazard.fun = my.hazard)
## A simulated data set with time-varying covariates
# }
# NOT RUN {
simdata <- sim.survdata(N=1000, T=100, type="tvc", xvars=5, num.data.frames=1)
summary(simdata$data)
model <- coxph(Surv(start, end, failed) ~ X1 + X2 + X3 + X4 + X5, data=simdata$data)
model$coefficients ## model-estimated coefficients
simdata$betas ## "true" coefficients
# }
# NOT RUN {
## A simulated data set with time-varying coefficients
simdata <- sim.survdata(N=1000, T=100, type="tvbeta", num.data.frames = 1)
simdata$betas
# }
Run the code above in your browser using DataLab