create_folds: Create Folds

Description

This function provides a list of row indices used for k-fold cross-validation (basic, stratified, grouped, or blocked). Repeated fold creation is supported as well. By default, in-sample indices are returned.

Usage

create_folds(
  y,
  k = 5L,
  type = c("stratified", "basic", "grouped", "blocked"),
  n_bins = 10L,
  m_rep = 1L,
  use_names = TRUE,
  invert = FALSE,
  shuffle = FALSE,
  seed = NULL
)

Value

If invert = FALSE (the default), a list with in-sample row indices. If invert = TRUE, a list with out-of-sample indices.

Arguments

y: Either the variable used for "stratification" or "grouped" splits. For other types of splits, any vector of the same length as the data intended to split.
k: Number of folds.
type: Split type. One of "stratified" (default), "basic", "grouped", "blocked".
n_bins: Approximate numbers of bins for numeric y (only for type = "stratified").
m_rep: How many times should the data be split into k folds? Default is 1, i.e., no repetitions.
use_names: Should folds be named? Default is TRUE.
invert: Set to TRUE in order to receive out-of-sample indices. Default is FALSE, i.e., in-sample indices are returned.
shuffle: Should row indices be randomly shuffled within folds? Default is FALSE.
seed: Integer random seed.

Details

By default, the function uses stratified splitting. This will balance the folds regarding the distribution of the input vector y. (Numeric input is first binned into n_bins quantile groups.) If type = "grouped", groups specified by y are kept together when splitting. This is relevant for clustered or panel data. In contrast to basic splitting, type = "blocked" does not sample indices at random, but rather keeps them in sequential groups.

Examples

Run this code

y <- rep(c(letters[1:4]), each = 5)
create_folds(y)
create_folds(y, k = 2)
create_folds(y, k = 2, m_rep = 2)
create_folds(y, k = 3, type = "blocked")