Learn R Programming

surveycore (version 0.8.3)

infer_question_prefaces: Infer Question Prefaces from Variable Labels

Description

Scans variable labels in a survey design object or labelled data frame for groups of variables sharing a common preface (via separator or longest common prefix). Detected prefaces are written to question_preface in the metadata and the shared text is trimmed from each variable label, leaving only the unique suffix.

Usage

infer_question_prefaces(
  x,
  sep = c(" - ", "- ", " – ", ": ", " | "),
  min_vars = 2L,
  lcp_min = 20L,
  overwrite = FALSE,
  verbose = TRUE
)

Value

The modified x, invisibly.

Arguments

x

A survey design object (survey_taylor, survey_replicate, etc.) or a data frame with haven-style "label" attributes.

sep

Character vector of literal separator strings to try, in priority order. Default: c(" - ", "- ", " \u2013 ", ": ", " | ").

min_vars

Minimum number of variables that must share a candidate preface to trigger extraction. Default 2L.

lcp_min

Minimum character length (after trimming to a word boundary) for an LCP-derived preface to be accepted. Default 20L.

overwrite

If FALSE (default), variables that already have a question_preface are skipped and a warning is emitted. Set TRUE to replace existing prefaces without warning.

verbose

If TRUE (default), emits a cli summary for each detected group.

Details

Detection algorithm (two passes):

  1. Separator pass — for each separator in sep (tried in order):

    • Variables whose label contains the separator are grouped by their candidate preface (text before the first occurrence of the separator, trimmed).

    • Any group with \(\geq\) min_vars members is recorded; those variables are excluded from all subsequent passes.

  2. LCP pass — for remaining labelled variables (\(\geq\) 2):

    • The character-level longest common prefix (LCP) of all remaining labels is computed and trimmed to the last word boundary.

    • If the trimmed LCP is \(\geq\) lcp_min characters, the group is recorded.

Apply step:

  • Variables with an existing question_preface are skipped when overwrite = FALSE (default); a warning is emitted listing the count of skipped variables.

  • Variables whose unique suffix would be empty after trimming are always skipped with a per-variable warning.

Data frame integration: When called on a data frame, the detected preface is written to attr(col, "question_preface"). Passing the result to as_survey() automatically picks up both the trimmed label and the preface via the internal haven metadata extraction step.

See Also

Other metadata: classify_question_type(), extract_metadata(), extract_missing_codes(), extract_question_preface(), extract_sata(), extract_universe(), extract_val_labels(), extract_var_label(), extract_var_note(), set_missing_codes(), set_question_preface(), set_sata(), set_universe(), set_val_labels(), set_var_label(), set_var_note(), survey_metadata(), survey_weighting_history()

Examples

Run this code
# Data frame with haven-style labels (Qualtrics / SPSS export pattern)
df <- data.frame(
  discrim_a = 1:5,
  discrim_b = 2:6,
  discrim_c = 3:7
)
attr(df$discrim_a, "label") <-
  "Please rate discrimination - Evangelical Christians"
attr(df$discrim_b, "label") <-
  "Please rate discrimination - Muslims"
attr(df$discrim_c, "label") <-
  "Please rate discrimination - Jews"

df <- infer_question_prefaces(df, verbose = FALSE)
attr(df$discrim_a, "label")            # "Evangelical Christians"
attr(df$discrim_a, "question_preface") # "Please rate discrimination"

Run the code above in your browser using DataLab