Learn R Programming

LLMing (version 1.1.0)

text_datagen: Generate text data via Python LLM

Description

All prompt components and example texts are provided by the user as function arguments. This function generates text data based on severity score from a given questionnaire.

Usage

text_datagen(
  prompts,
  examples,
  scenario = NULL,
  overall_rules = NULL,
  percentile_scaffold = NULL,
  item_rules = NULL,
  items = NULL,
  structure_rules = NULL,
  percentile_specification = NULL,
  band_specification = NULL,
  example_instruction = NULL,
  what_to_write = NULL,
  task_desc = NULL,
  target_min = 90L,
  target_max = 100L,
  temperature = 0.4,
  top_p = 0.9,
  repetition_penalty = 1.1,
  model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
  batch_size = 2L,
  python = Sys.getenv("RETICULATE_PYTHON", "python"),
  env = NULL,
  output_file = NULL
)

Value

A data.frame with columns id, severity, and response. @examples prompts <- data.frame( id = 1:2, severity = c(10, 80), num_examples = c(1, 1) ) examples <- data.frame( text = c("Example A", "Example B"), label = c("group1", "group2"), stringsAsFactors = FALSE ) out <- text_datagen( prompts = prompts, examples = examples, scenario = "This is an EMA study on depression", overall_rules = "Write 100 tokens of a diary entry collected every 6 hours.", percentile_scaffold = "The 90th percentile corresponds with severe depression and the 10th percentile corresponds with mild depression", item_rules = "For the 90th percentile, you should write as though you scored a 3 on all items", items = "Insert full battery here.", structure_rules = "Short paragraph.", percentile_specification = "Test specification.", band_specification = "Test bands.", example_instruction = "Here are examples.", what_to_write = "Write no less than 100 tokens and no more than 200 tokens", task_desc = "You are a participant in an EMA study on depression scoring in the 90th percentile of X battery.", target_min = 10, target_max = 20, temperature = 0.9, top_p = 0.9, repetition_penalty = 1.0, model_name = "sshleifer/tiny-gpt2", env = NULL # No token needed )

Arguments

prompts

A data.frame with one row per diary to generate. Must contain at least a column indicating severity level.

examples

A data.frame of example diary texts with columns: text or character column and any grouping severity variable column).

scenario

Character string used in the SCENARIO section. This describes the situation in which the data is being collected.

overall_rules

Character string describing global writing rules.

percentile_scaffold

Character string describing how percentiles map onto severity.

item_rules

Character string describing how to internally choose symptom patterns.

items

Character string of the battery under study.

structure_rules

Character string describing structural rules (paragraphs, length, etc.).

percentile_specification

Character string describing what the severity percentile means.

band_specification

Character string describing severity bands, that is, what you expect each band of severity to look like in text.

example_instruction

Character string introducing the example texts.

what_to_write

Character string describing what the model should write about.

task_desc

Character string for the system-level role description.

target_min

Integer minimum number of tokens to generate.

target_max

Integer maximum number of tokens to generate.

temperature

Numeric temperature for sampling.

top_p

Numeric top-p nucleus sampling value.

repetition_penalty

Numeric repetition penalty.

model_name

Model identifier string to pass to transformers (e.g., "meta-llama/Meta-Llama-3-8B-Instruct", a local path, etc.).

batch_size

Integer, passed through to the Python script (not heavily used yet).

python

Path to the Python executable. Defaults to Sys.getenv("RETICULATE_PYTHON", "python").

env

Optional named character vector or list of environment variables to set for the duration of the call (e.g., c(HUGGINGFACE_HUB_TOKEN = "xxx", OPENAI_API_KEY = "yyy")). Any variables set here are restored to their previous values on exit.

output_file

Optional path to save the output CSV. If NULL, a temporary file is used and only the data.frame is returned.