rSplit: Stratified Random Split Sampling

Description

Random split sampling, stratified based on the type of the response.

rSplit(y, nsplit, stratify = TRUE, s_ratio = 0.8, ...)

Function rSplit returns a length-nsplit

vector, the TRUE elements indicate training subjects and the FALSE elements indicate test subjects.

y: a double vector, a logical vector, a factor, or a Surv object, response \(y\)
nsplit: positive integer scalar, number of replicates of random splits to be performed
stratify: logical scalar, whether stratification based on response \(y\) needs to be implemented, default TRUE
s_ratio: double scalar between 0 and 1, split ratio, i.e., percentage of training subjects \(p\), default .8
...: additional parameters, currently not in use

Function rSplit performs random split sampling, with or without stratification. Specifically,

If stratify = FALSE, or if we have a double response \(y\), then split the sample into a training and a test set by odds \(p/(1-p)\), without stratification.
Otherwise, split a Surv response \(y\), stratified by its censoring status. Specifically, split subjects with observed event into a training and a test set by odds \(p/(1-p)\), and split the censored subjects into a training and a test set by odds \(p/(1-p)\). Then combine the training sets from subjects with observed events and censored subjects, and combine the test sets from subjects with observed events and censored subjects.
Otherwise, split a logical response \(y\), stratified by itself. Specifically, split the subjects with TRUE response into a training and a test set by odds \(p/(1-p)\), and split the subjects with FALSE response into a training and a test set by odds \(p/(1-p)\). Then combine the training sets, and the test sets, in a similar fashion as described above.
Otherwise, split a factor response \(y\), stratified by its levels. Specifically, split the subjects in each level of \(y\) into a training and a test set by odds \(p/(1-p)\). Then combine the training sets, and the test sets, from all levels of \(y\).

rSplit(y = rep(c(TRUE, FALSE), times = c(20, 30)), nsplit = 3L)

Run the code above in your browser using DataLab