The training dataset contains real-life survival data from patients who underwent primary surgery for breast cancer between 1978 and 1993 in Rotterdam. The patients were followed until 2007, resulting in a model development cohort of 2982 patients after exclusions. The primary outcome measured was recurrence-free survival, defined as the time from primary surgery to recurrence or death.
The validation dataset consists of 686 patients with primary node-positive breast cancer from the German Breast Cancer Study Group. In this cohort, 285 patients suffered a recurrence or died within 5 years of follow-up, while 280 were censored before 5 years. Five-year predictions were chosen as that was the lowest median survival from the two cohorts (Rotterdam cohort, 6.7 years; German cohort, 4.9 years).
data(trainDataSurvival)
data(testDataSurvival)
A data frame with observations on the following 26 variables.
patient identifier
year of surgery
age at surgery
menopausal status (0 = premenopausal, 1 = postmenopausal)
tumor size, a factor with levels <= 20, 20-50, >50
differentiation grade
number of positive lymph nodes
progesterone receptors (fmol/l)
estrogen receptors (fmol/l)
hormonal treatment (0 = no, 1 = yes)
chemotherapy
days to relapse or last follow-up
0 = no relapse, 1 = relapse
days to death or last follow-up
0 = alive, 1 = dead
Follow-up time for RFS, in years (numeric)
Recurrence-free survival status (0 = no event, 1 = event) (numeric)
Winsorized progesterone receptor level (numeric)
Winsorized node count (numeric)
Categorized tumor size, copied from size
(factor)
Categorized node involvement (factor: "0", "1-3", ">3")
Recoded grade factor (levels: "1-2", "3")
Restricted cubic spline basis for nodes2
(numeric)
Restricted cubic spline basis for original pgr
(numeric)
Follow-up epoch indicator after splitting at 5 years (numeric)
The data sets are based on the publicly available code and data used in the repository Prediction_performance_survival by Giardiello et al. (2023), which accompanies the Annals of Internal Medicine article "Assessing Performance and Clinical Usefulness in Prediction Models With Survival Outcomes: Practical Guidance for Cox Proportional Hazards Models".
All preprocessing steps, such as converting survival time to years, defining recurrence-free survival status via `rfs = pmax(recur, death)`, correcting 43 discordant cases using death time, 99th-percentile winsorization of `pgr` and `nodes`, spline transformations (`nodes3`, `pgr3`), splitting follow-up at 5 years (`epoch`), and recoding categorical variables (`csize`, `cnode`, `grade3`)—were performed exactly as in the Giardiello code.
The training dataset, trainDataSurvival
, consists of 2982 patients, with 1713 events occurring over a maximum
follow-up time of 19.3 years. The estimated median potential follow-up time, calculated using the reverse Kaplan-
method, was 9.3 years. Out of these patients, 1275 suffered a recurrence or death within the follow-up time of interest
(5 years), and 126 were censored before 5 years.
The validation dataset, testDataSurvival
, consists of 686 patients with primary node-positive breast cancer
from the German Breast Cancer Study Group. In this cohort, 285 patients suffered a recurrence or died within 5 years
of follow-up, while 280 were censored before 5 years. Five-year predictions were chosen as that was the lowest median
survival from the two cohorts (Rotterdam cohort, 6.7 years; German cohort, 4.9 years).
David J. McLernon, Daniele Giardiello, Ben Van Calster, et al. (2023). Assessing Performance and Clinical Usefulness in Prediction Models With Survival Outcomes: Practical Guidance for Cox Proportional Hazards Models. Annals of Internal Medicine, 176(1), pp. 105-114, doi:10.7326/M22-0844
data(testDataSurvival)
## Explore the structure of the dataset
str(testDataSurvival)
Run the code above in your browser using DataLab