sample_groups: Sample groups from a grouped dataset

Description

This helper selects a subset of groups from a grouped dataset. Groups can be drawn randomly, by ordering groups from the top or bottom according to a summary expression, or by filtering with a custom condition. The function is designed to work with datasets that were grouped using dplyr::group_by().

Usage

sample_groups(
  dataset,
  n = 1,
  sample = c("top", "bottom", "random"),
  order.by = dplyr::cur_group_id(),
  condition = NULL
)

Value

A grouped tibble containing only the sampled groups.

Arguments

dataset: A grouped dataset. Expects a data frame grouped with dplyr::group_by().
n: Number of groups to return. Defaults to 1. Ignored when condition is supplied and n is NULL.
sample: Sampling strategy. Must be one of "random", "top" (the default), or "bottom". Alternatively, a numeric vector can be provided to select group positions (using bottom ordering); when numeric, n is ignored. When condition is provided, the sample value is ignored and conditional filtering is applied instead.
order.by: Expression used to order groups when sample is set to "top" or "bottom". Evaluated in a one-row summary for each group. Defaults to dplyr::cur_group_id(), i.e., the group number.
condition: Logical expression used to filter the summarised groups. Evaluated in a one-row summary for each group, which includes an .order_value column derived from order.by.

Examples

Run this code

#gives one last group (highest group id)
sample.data.environment |>
  sample_groups() |>
  dplyr::group_keys()

#gives one random group (highest group id)
sample.data.environment |>
  sample_groups(sample = "random") |>
  dplyr::group_keys()

#gives the group with the highest average melanopic EDI
sample.data.environment |>
  sample_groups(order.by = mean(MEDI)) |>
  dplyr::group_keys()

#gives the group with the lowest average melanopic EDI
sample.data.environment |>
  sample_groups(sample = "bottom", order.by = mean(MEDI)) |>
  dplyr::group_keys()

# give only groups that have a median melanopic EDI > 1000 lx
sample.data.environment |>
  sample_groups(condition = median(MEDI, na.rm = TRUE) > 1000) |>
  dplyr::group_keys()

# return only days with time above 250 lx mel EDI > 7 hours
sample.data.environment |>
  add_Date_col(group.by = TRUE) |>
  sample_groups(order.by = duration_above_threshold(MEDI, Datetime, threshold = 250),
                condition = .order_value > 7*60*60) |>
  dplyr::group_keys()
  
# return the 5 days with the highest time above 250 lx mel EDI
sample.data.environment |>
  add_Date_col(group.by = TRUE) |>
  sample_groups(
    n = 5,
    order.by = duration_above_threshold(MEDI, Datetime, threshold = 250),
    ) |>
  dplyr::group_keys()

# gives the first group
sample.data.environment |>
  sample_groups(sample = 1) |>
  dplyr::group_keys()

# gives the second group
sample.data.environment |>
  sample_groups(sample = 2) |>
  dplyr::group_keys()

Run the code above in your browser using DataLab