RealSurvSim

RealSurvSim is an R package that provides a variety of methods for simulating survival (time-to-event) datasets. It is particularly useful for survival analysis applications in research and simulation studies. The package includes both non-parametric (kernel density estimation), parametric, and bootstrap-based simulation approaches for generating realistic time-to-event data.

Features

Parametric Simulation: Fit a distribution (e.g., exponential, Weibull, log-logistic, mixture distributions) to existing data and generate new samples from the fitted distribution.
Kernel Density Simulation: Non-parametric simulation via kernel density estimation, using an accept-reject approach.
Bootstrap Methods:
- Conditional Bootstrap (cond): Splits event and censoring times, then resamples to preserve the observed event/censoring ratio.
- Case Resampling (case): Simple random resampling of entire observations with replacement.
Flexible Group/Strata Handling: Simulate data separately by group while preserving group sizes or allowing user-specified sample sizes.

Installation

1. From Source

If you have downloaded or cloned this repository:

# Install devtools if you don't already have it
install.packages("devtools")

# Then, from the root of the package directory:
devtools::install_github()

Dependencies

This package uses several R libraries for density estimation, distribution fitting, and survival analysis. They will be automatically installed (if not already present) when installing RealSurvSim. Key dependencies include:

kdensity (for kernel density estimation)
fitdistrplus (for fitting various distributions to data)
flexsurv (for Gompertz and other survival distributions)
univariateML (for maximum-likelihood estimation of some distributions, e.g., inverse gamma)
actuar (for distributions like log-logistic and inverse gamma)
survival (core survival analysis functionality)

Usage

Below is an overview of the core functions and some example usages. For detailed information on parameters and return values, refer to the function documentation.

Core Functions

data_simul_KDE(orig_vals, n = NULL, kernel = "gaussian")
Simulates data via kernel density estimation from a numeric vector of original values.
- Parameters:
  - orig_vals: Numeric vector of original data values.
  - n: Number of observations to simulate (defaults to the length of orig_vals).
  - kernel: The kernel to use for KDE (currently supports "gaussian").
- Returns: A numeric vector of simulated values.
data_simul_Estim(orig_vals, n = NULL, distrib = "exp")
Fits a specified parametric distribution to orig_vals and draws new samples from the fitted distribution.
- Supported distributions include: "inverse_gamma", "gompertz", "llogis", "gumbel", "myMix", "exp".
data_simul_bootstr(dat, n = NULL, type = "cond")
Bootstrap-based simulation of event and censoring times.
- Parameters:
  - dat: Dataframe containing at least V1 (time) and V2 (censor indicator, 0/1).
  - n: Number of observations to sample. Defaults to the same size as dat.
  - type: "cond" for conditional bootstrap or "case" for case-resampling.
- Returns: A resampled or reconstructed dataframe containing simulated times and censor indicators.
RealSurvSim(dat, col_time, col_status, col_group, reps = 10000, random_seed = 123, n = NULL, simul_type, distribs = c("exp", "exp", "exp", "exp"))
The main wrapper function for simulating multiple survival datasets using one of four approaches:
- "cond": Conditional bootstrap
- "case": Case resampling
- "distr": Parametric distribution-based simulation
- "KDE": Kernel density estimation-based simulation
- Parameters:
  - dat: Original (or reconstructed) dataset with time, status, and group columns.
  - col_time: Column name/index for time.
  - col_status: Column name/index for censoring indicator (1=event, 0=censored).
  - col_group: Column name/index for treatment/group identifier.
  - reps: Number of datasets to simulate (default 10,000).
  - random_seed: Random seed (default 123) for reproducibility.
  - n: Vector specifying sample sizes per group (optional).
  - simul_type: Single string specifying the simulation method ("cond", "case", "distr", "KDE").
  - distribs: Which distributions to use if simul_type = "distr".
- Returns:
  A list containing multiple simulated datasets (one for each repetition). Each dataset is a data.frame with columns V1 (time), V2 (status), and V3 (group).

Examples

Below are brief examples demonstrating how to simulate data. In practice, replace the placeholders (example_data, "time", etc.) with your actual dataset and column names.

library(RealSurvSim)

# Example dataset construction (for demonstration):
set.seed(123)
example_data <- data.frame(
  time = rexp(100, rate = 0.1),            # Times
  status = sample(0:1, 100, replace = TRUE), # 0=censored, 1=event
  group = sample(0:1, 100, replace = TRUE)   # Two groups, 0 or 1
)

# 1. Kernel Density Estimation Simulation
sim_kde <- RealSurvSim(
  dat = example_data,
  col_time   = "time",
  col_status = "status",
  col_group  = "group",
  reps       = 5,            # Simulate 5 datasets
  simul_type = "KDE"         # Use KDE-based simulation
)
str(sim_kde$datasets)  # Check the structure of generated datasets

# 2. Parametric Distribution Simulation
sim_distr <- RealSurvSim(
  dat = example_data,
  col_time   = "time",
  col_status = "status",
  col_group  = "group",
  reps       = 5,
  simul_type = "distr",
  distribs   = c("exp", "exp", "exp", "exp")
)
str(sim_distr$datasets)

# 3. Conditional Bootstrap
sim_cond <- RealSurvSim(
  dat = example_data,
  col_time   = "time",
  col_status = "status",
  col_group  = "group",
  reps       = 5,
  simul_type = "cond"
)
str(sim_cond$datasets)

# 4. Case Resampling
sim_case <- RealSurvSim(
  dat = example_data,
  col_time   = "time",
  col_status = "status",
  col_group  = "group",
  reps       = 5,
  simul_type = "case"
)
str(sim_case$datasets)

data(liang)
data(wu)
# 5. liang_kde<- RealSurvSim(liang, liang$V1, liang$V2, liang$V3, reps=3, simul_type = "KDE")

# For arbitary n
# 6. arbliang_distr<- RealSurvSim(liang,  liang$V1, liang$V2, liang$V3,reps=10,n = c(40,50), simul_type = "distr", distrib=c("exp", "llogis","llogis", "exp"))

# 7. arbwu_case<- RealSurvSim(wu, wu$V1, wu$V2, wu$V3, reps=100,n = c(40,50),  simul_type = "case")

References and Further Reading

Underlying Paper for the Package
Analysis and Methods for Survival Data (arXiv:2308.07842)

Data Reconstruction Algorithm
Guyot et al. (2012), describing the algorithm for reconstructing survival data from published Kaplan-Meier curves.

WebPlotDigitizer
WebPlotDigitizer for extracting data points from Kaplan-Meier curves.

RealSurvSim

Features

Installation

1. From Source

Dependencies

Usage

Core Functions

Examples

References and Further Reading

Copy Link

Version

Install

Version

License

Maintainer

Last Published

Functions in RealSurvSim (1.0.0)