simulatedsurvivaldata: Breast Cancer Survival Data from Rotterdam and Germany

Description

The training dataset contains real-life survival data from patients who underwent primary surgery for breast cancer between 1978 and 1993 in Rotterdam. The patients were followed until 2007, resulting in a model development cohort of 2982 patients after exclusions. The primary outcome measured was recurrence-free survival, defined as the time from primary surgery to recurrence or death.

The validation dataset consists of 686 patients with primary node-positive breast cancer from the German Breast Cancer Study Group. In this cohort, 285 patients suffered a recurrence or died within 5 years of follow-up, while 280 were censored before 5 years. Five-year predictions were chosen as that was the lowest median survival from the two cohorts (Rotterdam cohort, 6.7 years; German cohort, 4.9 years).

Usage

data(trainDataSurvival)
  data(testDataSurvival)

Arguments

Format

A data frame with observations on the following 26 variables.

pid: patient identifier
year: year of surgery
age: age at surgery
meno: menopausal status (0 = premenopausal, 1 = postmenopausal)
size: tumor size, a factor with levels <= 20, 20-50, >50
grade: differentiation grade
nodes: number of positive lymph nodes
pgr: progesterone receptors (fmol/l)
er: estrogen receptors (fmol/l)
hormon: hormonal treatment (0 = no, 1 = yes)
chemo: chemotherapy
rtime: days to relapse or last follow-up
recur: 0 = no relapse, 1 = relapse
dtime: days to death or last follow-up
death: 0 = alive, 1 = dead
ryear: Follow-up time for RFS, in years (numeric)
rfs: Recurrence-free survival status (0 = no event, 1 = event) (numeric)
pgr2: Winsorized progesterone receptor level (numeric)
nodes2: Winsorized node count (numeric)
csize: Categorized tumor size, copied from size (factor)
cnode: Categorized node involvement (factor: "0", "1-3", ">3")
grade3: Recoded grade factor (levels: "1-2", "3")
nodes3: Restricted cubic spline basis for nodes2 (numeric)
pgr3: Restricted cubic spline basis for original pgr (numeric)
epoch: Follow-up epoch indicator after splitting at 5 years (numeric)

Details

The data sets are based on the publicly available code and data used in the repository Prediction_performance_survival by Giardiello et al. (2023), which accompanies the Annals of Internal Medicine article "Assessing Performance and Clinical Usefulness in Prediction Models With Survival Outcomes: Practical Guidance for Cox Proportional Hazards Models".

All preprocessing steps, such as converting survival time to years, defining recurrence-free survival status via `rfs = pmax(recur, death)`, correcting 43 discordant cases using death time, 99th-percentile winsorization of `pgr` and `nodes`, spline transformations (`nodes3`, `pgr3`), splitting follow-up at 5 years (`epoch`), and recoding categorical variables (`csize`, `cnode`, `grade3`)—were performed exactly as in the Giardiello code.

The training dataset, trainDataSurvival, consists of 2982 patients, with 1713 events occurring over a maximum follow-up time of 19.3 years. The estimated median potential follow-up time, calculated using the reverse Kaplan- method, was 9.3 years. Out of these patients, 1275 suffered a recurrence or death within the follow-up time of interest (5 years), and 126 were censored before 5 years.

The validation dataset, testDataSurvival, consists of 686 patients with primary node-positive breast cancer from the German Breast Cancer Study Group. In this cohort, 285 patients suffered a recurrence or died within 5 years of follow-up, while 280 were censored before 5 years. Five-year predictions were chosen as that was the lowest median survival from the two cohorts (Rotterdam cohort, 6.7 years; German cohort, 4.9 years).

References

David J. McLernon, Daniele Giardiello, Ben Van Calster, et al. (2023). Assessing Performance and Clinical Usefulness in Prediction Models With Survival Outcomes: Practical Guidance for Cox Proportional Hazards Models. Annals of Internal Medicine, 176(1), pp. 105-114, doi:10.7326/M22-0844

Examples

Run this code

data(testDataSurvival)
## Explore the structure of the dataset
str(testDataSurvival)

Run the code above in your browser using DataLab