freq_by: Frequency Table by Group (wide): n (%) with flexible ordering and formats

Description

freq_by() produces a one-level frequency table by treatment (wide layout) where each row is a category of last_group (e.g., a bucketed lab value), and each treatment column shows n (%) using distinct subject counts.

New: If fmt is not provided (NULL), labels are derived from the unique values present in data[[last_group]] (post na_to_code mapping, if used).

It supports:

SAS-style rounding (use_sas_round = TRUE) for the percent.
Format mapping via either a named vector or a tibble/data.frame with columns value (codes) and raw (labels).
Ordering by the numeric value of last_group found in the data, or optionally the union of format + data codes (include_all_fmt_levels).
Counting NA under a chosen code/label using na_to_code (e.g., code "4" = "MISSING").
Auto-detecting the subject ID column when id_var is not provided.

Usage

freq_by(
  data,
  denom_data = NULL,
  main_group,
  last_group,
  label,
  sec_ord,
  fmt = NULL,
  use_sas_round = FALSE,
  indent = 2,
  id_var = "USUBJID",
  include_all_fmt_levels = TRUE,
  na_to_code = NULL
)

Value

A tibble with:

stat (character), sort_ord (integer), sec_ord (integer),
One column per treatment arm (e.g., trt1, trt2, …), with "n (pct)" or "0".

Arguments

data

A data frame containing at least main_group, last_group, and an ID column.

denom_data

Optional data frame used to derive denominators (N per treatment). Defaults to data.

main_group

Character scalar. The treatment or grouping variable name (columns in output), e.g., "TRTAN".

last_group

Character scalar. The categorical code variable to tabulate (rows). Numeric or character are both accepted; converted to character for display/ordering.

label

Character scalar. A header row displayed on top (unindented).

sec_ord

Integer scalar carried through for downstream table sorting.

fmt

Optional. Either:

a named character vector like c("1"="<1","2"="1-<4",...) (names = codes, values = labels), or
a data.frame/tibble with columns value (codes) and raw (labels), or
a string naming an object (in parent frame) that resolves to either of the above. If NULL (default), labels are derived from unique values of data[[last_group]].

use_sas_round

Logical; if TRUE, percent is rounded with SAS-compatible “round halves away from zero” via sas_round(). Default FALSE.

indent

Integer number of leading spaces applied to all category rows (the first label row is not indented). Default 2.

id_var

Character; the subject identifier column. If not found in data, the function tries common alternatives (e.g., USUBJID, SUBJID, etc.).

include_all_fmt_levels

Logical; if TRUE (default), the row order is built from the union of format codes and data codes (numeric sort). When fmt = NULL, this effectively reduces to observed data codes only.

na_to_code

Optional character scalar (e.g., "4"). If supplied, NA values in last_group are counted under that code before tabulation.

Details

Counting uses n_distinct(id_var) within each (main_group, last_group) cell.
Percent is 100 * n / N where N = distinct subjects in denom_data by main_group.
When fmt = NULL, both codes and labels are taken from the observed values of last_group (after applying na_to_code mapping), ordered numerically where possible.
Output treatment columns are normalized to trtXX if original names start with digits.
Missing treatment arms are added as "0".

Examples

Run this code

set.seed(1)

toy_adsl <- tibble::tibble(
  USUBJID = sprintf("ID%03d", 1:60),
  TRTAN   = sample(c(1, 2), size = 60, replace = TRUE),
  AGE     = sample(18:85, size = 60, replace = TRUE),
  SEX     = sample(c("Male", "Female"), size = 60, replace = TRUE),
  ETHNIC  = sample(
    c("Hispanic or Latino",
      "Not Hispanic or Latino",
      "Unknown",
      NA_character_),
    size = 60, replace = TRUE
  )
) |>
  dplyr::mutate(
    AGEGR1 = dplyr::case_when(
      AGE < 65            ~ "<65 years",
      AGE >= 65 & AGE < 75 ~ "65–<75 years",
      AGE >= 75           ~ ">=75 years"
    )
  )

toy_dm <- toy_adsl |>
  dplyr::select(USUBJID, TRTAN)

freq_by(
  data       = toy_adsl,
  denom_data = toy_dm,
  main_group = "TRTAN",
  last_group = "AGEGR1",
  label      = "Age group, n (%)",
  sec_ord    = 1,
  fmt        = NULL,
  na_to_code = NULL
)

freq_by(
  data       = toy_adsl,
  denom_data = toy_dm,
  main_group = "TRTAN",
  last_group = "SEX",
  label      = "Sex, n (%)",
  sec_ord    = 2,
  fmt        = NULL,
  na_to_code = "99"
)

fmt_ethnic <- c(
  "Hispanic or Latino"         = "Hispanic or Latino",
  "Not Hispanic or Latino"     = "Not Hispanic or Latino",
  "Unknown"                    = "Unknown",
  "99"                         = "Missing"
)

freq_by(
  data       = toy_adsl,
  denom_data = toy_dm,
  main_group = "TRTAN",
  last_group = "ETHNIC",
  label      = "Ethnic group, n (%)",
  sec_ord    = 3,
  fmt        = fmt_ethnic,
  include_all_fmt_levels = TRUE,
  na_to_code = "99"
)

Run the code above in your browser using DataLab