RealSurvSim
RealSurvSim is an R package that provides a variety of methods for simulating survival (time-to-event) datasets. It is particularly useful for survival analysis applications in research and simulation studies. The package includes both non-parametric (kernel density estimation), parametric, and bootstrap-based simulation approaches for generating realistic time-to-event data.
Features
- Parametric Simulation: Fit a distribution (e.g., exponential, Weibull, log-logistic, mixture distributions) to existing data and generate new samples from the fitted distribution.
- Kernel Density Simulation: Non-parametric simulation via kernel density estimation, using an accept-reject approach.
- Bootstrap Methods:
- Conditional Bootstrap (
cond): Splits event and censoring times, then resamples to preserve the observed event/censoring ratio. - Case Resampling (
case): Simple random resampling of entire observations with replacement.
- Conditional Bootstrap (
- Flexible Group/Strata Handling: Simulate data separately by group while preserving group sizes or allowing user-specified sample sizes.
Installation
1. From Source
If you have downloaded or cloned this repository:
# Install devtools if you don't already have it
install.packages("devtools")
# Then, from the root of the package directory:
devtools::install_github()
Dependencies
This package uses several R libraries for density estimation, distribution fitting, and survival analysis. They will be automatically installed (if not already present) when installing RealSurvSim. Key dependencies include:
- kdensity (for kernel density estimation)
- fitdistrplus (for fitting various distributions to data)
- flexsurv (for Gompertz and other survival distributions)
- univariateML (for maximum-likelihood estimation of some distributions, e.g., inverse gamma)
- actuar (for distributions like log-logistic and inverse gamma)
- survival (core survival analysis functionality)
Usage
Below is an overview of the core functions and some example usages. For detailed information on parameters and return values, refer to the function documentation.
Core Functions
data_simul_KDE(orig_vals, n = NULL, kernel = "gaussian")
Simulates data via kernel density estimation from a numeric vector of original values.- Parameters:
orig_vals: Numeric vector of original data values.n: Number of observations to simulate (defaults to the length oforig_vals).kernel: The kernel to use for KDE (currently supports"gaussian").
- Returns: A numeric vector of simulated values.
- Parameters:
data_simul_Estim(orig_vals, n = NULL, distrib = "exp")
Fits a specified parametric distribution toorig_valsand draws new samples from the fitted distribution.- Supported distributions include:
"inverse_gamma","gompertz","llogis","gumbel","myMix","exp".
- Supported distributions include:
data_simul_bootstr(dat, n = NULL, type = "cond")
Bootstrap-based simulation of event and censoring times.- Parameters:
dat: Dataframe containing at leastV1(time) andV2(censor indicator, 0/1).n: Number of observations to sample. Defaults to the same size asdat.type:"cond"for conditional bootstrap or"case"for case-resampling.
- Returns: A resampled or reconstructed dataframe containing simulated times and censor indicators.
- Parameters:
RealSurvSim(dat, col_time, col_status, col_group, reps = 10000, random_seed = 123, n = NULL, simul_type, distribs = c("exp", "exp", "exp", "exp"))
The main wrapper function for simulating multiple survival datasets using one of four approaches:"cond": Conditional bootstrap"case": Case resampling"distr": Parametric distribution-based simulation"KDE": Kernel density estimation-based simulationParameters:
dat: Original (or reconstructed) dataset with time, status, and group columns.col_time: Column name/index for time.col_status: Column name/index for censoring indicator (1=event, 0=censored).col_group: Column name/index for treatment/group identifier.reps: Number of datasets to simulate (default 10,000).random_seed: Random seed (default 123) for reproducibility.n: Vector specifying sample sizes per group (optional).simul_type: Single string specifying the simulation method ("cond","case","distr","KDE").distribs: Which distributions to use ifsimul_type = "distr".
Returns:
A list containing multiple simulated datasets (one for each repetition). Each dataset is a data.frame with columnsV1(time),V2(status), andV3(group).
Examples
Below are brief examples demonstrating how to simulate data. In practice, replace the placeholders (example_data, "time", etc.) with your actual dataset and column names.
library(RealSurvSim)
# Example dataset construction (for demonstration):
set.seed(123)
example_data <- data.frame(
time = rexp(100, rate = 0.1), # Times
status = sample(0:1, 100, replace = TRUE), # 0=censored, 1=event
group = sample(0:1, 100, replace = TRUE) # Two groups, 0 or 1
)
# 1. Kernel Density Estimation Simulation
sim_kde <- RealSurvSim(
dat = example_data,
col_time = "time",
col_status = "status",
col_group = "group",
reps = 5, # Simulate 5 datasets
simul_type = "KDE" # Use KDE-based simulation
)
str(sim_kde$datasets) # Check the structure of generated datasets
# 2. Parametric Distribution Simulation
sim_distr <- RealSurvSim(
dat = example_data,
col_time = "time",
col_status = "status",
col_group = "group",
reps = 5,
simul_type = "distr",
distribs = c("exp", "exp", "exp", "exp")
)
str(sim_distr$datasets)
# 3. Conditional Bootstrap
sim_cond <- RealSurvSim(
dat = example_data,
col_time = "time",
col_status = "status",
col_group = "group",
reps = 5,
simul_type = "cond"
)
str(sim_cond$datasets)
# 4. Case Resampling
sim_case <- RealSurvSim(
dat = example_data,
col_time = "time",
col_status = "status",
col_group = "group",
reps = 5,
simul_type = "case"
)
str(sim_case$datasets)
data(liang)
data(wu)
# 5. liang_kde<- RealSurvSim(liang, liang$V1, liang$V2, liang$V3, reps=3, simul_type = "KDE")
# For arbitary n
# 6. arbliang_distr<- RealSurvSim(liang, liang$V1, liang$V2, liang$V3,reps=10,n = c(40,50), simul_type = "distr", distrib=c("exp", "llogis","llogis", "exp"))
# 7. arbwu_case<- RealSurvSim(wu, wu$V1, wu$V2, wu$V3, reps=100,n = c(40,50), simul_type = "case")References and Further Reading
Underlying Paper for the Package
Analysis and Methods for Survival Data (arXiv:2308.07842)
Data Reconstruction Algorithm
Guyot et al. (2012), describing the algorithm for reconstructing survival data from published Kaplan-Meier curves.
WebPlotDigitizer
WebPlotDigitizer for extracting data points from Kaplan-Meier curves.