clusterBootstrap: Cluster Bootstrap

Description

Performs bootstrapping on hierarchically structured data using clustered or nested resampling at any level of the hierarchy. Allows bootstrapping of arbitrary statistics computed from the resampled dataset.

Usage

clusterBootstrap(df, clusters, replace, stat_fun, B = 5000, ...)

Value

clusterBootstrap returns an object of class clusterBootstrap, containing the following elements:

call

The function call

args

Arguments passed to the function

estimates

A list with the following elements:

originalEstimates: a data.frame with one row, containing the return of stat_fun on the original data.
bootstrapEstimates: a data.frame with B rows, containing the return of stat_fun on each of the bootstrap samples.
bootstrapSE: the bootstrap standard error(s) for all rows in bootstrapEstimates.

Arguments

df: A data frame. The original dataset.
clusters: A character vector of variable names that define the nested structure of the data, ordered from highest to lowest level.
replace: A logical vector indicating whether sampling should be with replacement at each level. Should be of the same length as clusters.
stat_fun: A function that takes a data frame (a bootstrap sample) and returns a numeric vector of statistics.
B: Integer. The number of bootstrap samples to generate.
...: Additional arguments passed to stat_fun.

Author

Mathijs Deen

Examples

Run this code

if (FALSE) {
library(dplyr)
medData <- medication |>
filter(time %% 1 == 0, time < 4)
bootFun <- function(d) lm(pos ~ treat*time, data = d)$coefficients

# Resampling on the person level only
clusterBootstrap(df       = medData, 
                 clusters = "id", 
                 replace  = TRUE, 
                 stat_fun = bootFun, 
                 B        = 5000)

# Resampling on the person level and the repeated measures level
clusterBootstrap(df       = medData, 
                 clusters = c("id", "time"), 
                 replace  = c(TRUE, TRUE), 
                 stat_fun = bootFun, 
                 B        = 5000)

# Not resampling at one level 
# (e.g., by design all classes in a probed school are included, 
# but not all students in a class)
set.seed(2025)
n_school  <- 30
n_class   <- 8
n_student <- 15

demo <- expand.grid(
school  = paste0("S", 1:n_school),
class   = paste0("C", 1:n_class),
student = paste0("P", 1:n_student)) |>
  mutate(score1 = rnorm(n()),
         score2 = rnorm(n())) |>
  arrange(school, class, student) |>
  slice(1:(n() - 3)) # slightly unbalanced data
bootFun2 <- function(d) lm(score1 ~ score2, data = d)$coef
clusterBootstrap(df       = demo, 
                 clusters = c("school", "class", "student"),
                 replace  = c(TRUE, FALSE, TRUE),
                 stat_fun = bootFun2,
                 B        = 1000)
}