split_into_train_validate_test: Split Dataframe into: 'train', 'validate', 'test'

Description

This function randomly splits a data frame into three subsets for machine learning workflows: training, validation, and test sets. The proportions can be customized and must sum to 1.

Usage

split_into_train_validate_test(
  df,
  train_prop = 0.7,
  validate_prop = 0.15,
  test_prop = 0.15,
  seed = NULL
)

Value

A named list with three elements:

train: A data frame containing the training subset
validate: A data frame containing the validation subset
test: A data frame containing the test subset

Arguments

df: A data frame to be split into subsets.
train_prop: A numeric value between 0 and 1 specifying the proportion of data to allocate to the training set.
validate_prop: A numeric value between 0 and 1 specifying the proportion of data to allocate to the validation set.
test_prop: A numeric value between 0 and 1 specifying the proportion of data to allocate to the test set.
seed: (optional) a numeric value to set the random no. seed within function environment.

Details

The function assigns each row to either "train", "validate" or "test" with the probability defined in the function.

Because each row is assigned a bucket independently, for very small datasets the proportions may not be as desired. This should not be an issue as data used for `iblm` must be reasonably large.

Examples

Run this code

# Using 'mtcars'
split_into_train_validate_test(
  mtcars,
  train_prop = 0.6,
  validate_prop = 0.2,
  test_prop = 0.2,
  seed = 9000
)

Run the code above in your browser using DataLab