text_datagen: Generate text data via Python LLM

Description

All prompt components and example texts are provided by the user as function arguments. This function generates text data based on severity score from a given questionnaire.

Usage

text_datagen(
  prompts,
  examples,
  scenario = NULL,
  overall_rules = NULL,
  percentile_scaffold = NULL,
  item_rules = NULL,
  items = NULL,
  structure_rules = NULL,
  percentile_specification = NULL,
  band_specification = NULL,
  example_instruction = NULL,
  what_to_write = NULL,
  task_desc = NULL,
  target_min = 90L,
  target_max = 100L,
  temperature = 0.4,
  top_p = 0.9,
  repetition_penalty = 1.1,
  model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
  batch_size = 2L,
  python = Sys.getenv("RETICULATE_PYTHON", "python"),
  env = NULL,
  output_file = NULL
)

Value

A data.frame with columns id, severity, and response. @examples prompts <- data.frame( id = 1:2, severity = c(10, 80), num_examples = c(1, 1) ) examples <- data.frame( text = c("Example A", "Example B"), label = c("group1", "group2"), stringsAsFactors = FALSE ) out <- text_datagen( prompts = prompts, examples = examples, scenario = "This is an EMA study on depression", overall_rules = "Write 100 tokens of a diary entry collected every 6 hours.", percentile_scaffold = "The 90th percentile corresponds with severe depression and the 10th percentile corresponds with mild depression", item_rules = "For the 90th percentile, you should write as though you scored a 3 on all items", items = "Insert full battery here.", structure_rules = "Short paragraph.", percentile_specification = "Test specification.", band_specification = "Test bands.", example_instruction = "Here are examples.", what_to_write = "Write no less than 100 tokens and no more than 200 tokens", task_desc = "You are a participant in an EMA study on depression scoring in the 90th percentile of X battery.", target_min = 10, target_max = 20, temperature = 0.9, top_p = 0.9, repetition_penalty = 1.0, model_name = "sshleifer/tiny-gpt2", env = NULL # No token needed )

Arguments

prompts: A data.frame with one row per diary to generate. Must contain at least a column indicating severity level.
examples: A data.frame of example diary texts with columns: text or character column and any grouping severity variable column).
scenario: Character string used in the SCENARIO section. This describes the situation in which the data is being collected.
overall_rules: Character string describing global writing rules.
percentile_scaffold: Character string describing how percentiles map onto severity.
item_rules: Character string describing how to internally choose symptom patterns.
items: Character string of the battery under study.
structure_rules: Character string describing structural rules (paragraphs, length, etc.).
percentile_specification: Character string describing what the severity percentile means.
band_specification: Character string describing severity bands, that is, what you expect each band of severity to look like in text.
example_instruction: Character string introducing the example texts.
what_to_write: Character string describing what the model should write about.
task_desc: Character string for the system-level role description.
target_min: Integer minimum number of tokens to generate.
target_max: Integer maximum number of tokens to generate.
temperature: Numeric temperature for sampling.
top_p: Numeric top-p nucleus sampling value.
repetition_penalty: Numeric repetition penalty.
model_name: Model identifier string to pass to transformers (e.g., "meta-llama/Meta-Llama-3-8B-Instruct", a local path, etc.).
batch_size: Integer, passed through to the Python script (not heavily used yet).
python: Path to the Python executable. Defaults to Sys.getenv("RETICULATE_PYTHON", "python").
env: Optional named character vector or list of environment variables to set for the duration of the call (e.g., c(HUGGINGFACE_HUB_TOKEN = "xxx", OPENAI_API_KEY = "yyy")). Any variables set here are restored to their previous values on exit.
output_file: Optional path to save the output CSV. If NULL, a temporary file is used and only the data.frame is returned.