infer_question_prefaces: Infer Question Prefaces from Variable Labels

Description

Scans variable labels in a survey design object or labelled data frame for groups of variables sharing a common preface (via separator or longest common prefix). Detected prefaces are written to question_preface in the metadata and the shared text is trimmed from each variable label, leaving only the unique suffix.

Usage

infer_question_prefaces(
  x,
  sep = c(" - ", "- ", " – ", ": ", " | "),
  min_vars = 2L,
  lcp_min = 20L,
  overwrite = FALSE,
  verbose = TRUE
)

Value

The modified x, invisibly.

Arguments

x: A survey design object (survey_taylor, survey_replicate, etc.) or a data frame with haven-style "label" attributes.
sep: Character vector of literal separator strings to try, in priority order. Default: c(" - ", "- ", " \u2013 ", ": ", " | ").
min_vars: Minimum number of variables that must share a candidate preface to trigger extraction. Default 2L.
lcp_min: Minimum character length (after trimming to a word boundary) for an LCP-derived preface to be accepted. Default 20L.
overwrite: If FALSE (default), variables that already have a question_preface are skipped and a warning is emitted. Set TRUE to replace existing prefaces without warning.
verbose: If TRUE (default), emits a cli summary for each detected group.

Details

Detection algorithm (two passes):

Separator pass — for each separator in sep (tried in order):
- Variables whose label contains the separator are grouped by their candidate preface (text before the first occurrence of the separator, trimmed).
- Any group with \(\geq\) min_vars members is recorded; those variables are excluded from all subsequent passes.
LCP pass — for remaining labelled variables (\(\geq\) 2):
- The character-level longest common prefix (LCP) of all remaining labels is computed and trimmed to the last word boundary.
- If the trimmed LCP is \(\geq\) lcp_min characters, the group is recorded.

Apply step:

Variables with an existing question_preface are skipped when overwrite = FALSE (default); a warning is emitted listing the count of skipped variables.
Variables whose unique suffix would be empty after trimming are always skipped with a per-variable warning.

Data frame integration: When called on a data frame, the detected preface is written to attr(col, "question_preface"). Passing the result to as_survey() automatically picks up both the trimmed label and the preface via the internal haven metadata extraction step.

Examples

Run this code

# Data frame with haven-style labels (Qualtrics / SPSS export pattern)
df <- data.frame(
  discrim_a = 1:5,
  discrim_b = 2:6,
  discrim_c = 3:7
)
attr(df$discrim_a, "label") <-
  "Please rate discrimination - Evangelical Christians"
attr(df$discrim_b, "label") <-
  "Please rate discrimination - Muslims"
attr(df$discrim_c, "label") <-
  "Please rate discrimination - Jews"

df <- infer_question_prefaces(df, verbose = FALSE)
attr(df$discrim_a, "label")            # "Evangelical Christians"
attr(df$discrim_a, "question_preface") # "Please rate discrimination"