package_extraction_prompt: Generate Function Extraction Prompt for LLM Analysis

Description

Creates a highly refined prompt that guides LLMs to identify ONLY the most documentation-critical, domain-specific R functions from a task description. The prompt uses sophisticated filtering criteria to exclude common, well-known functions (like read.csv, mean, order) that any LLM can use correctly without explicit documentation, focusing instead on specialized functions where examples truly add value.

Usage

package_extraction_prompt(
  task_description,
  include_criteria = NULL,
  exclude_criteria = NULL,
  prioritization_factors = NULL,
  emphasis = NULL
)

Value

Character string containing the complete extraction prompt with:

Clear documentation necessity principle
Strict inclusion criteria for domain-specific functions
Comprehensive exclusion rules with concrete examples
Four-question decision heuristic for each function
Concrete good/bad examples from multiple domains
Prioritization by domain specialization and complexity
Quality-over-quantity guidance

Arguments

task_description: Character string. Detailed description of the R task or analysis workflow that needs to be performed. Should include: - Data types and sources involved - Analytical objectives and methods - Expected outputs or deliverables - Domain-specific context (e.g., bioinformatics, spatial analysis) The more domain-specific the description, the better the function selection.
include_criteria: Character vector. Additional inclusion criteria beyond the defaults. Specify domain-specific requirements or function characteristics that should be documented. Default is NULL (use standard criteria).
exclude_criteria: Character vector. Additional exclusion criteria beyond the defaults. Specify function types or patterns that should be skipped (e.g., "Basic ggplot2 themes", "Standard dplyr verbs"). Default is NULL.
prioritization_factors: Character vector. Additional factors for prioritizing functions beyond the defaults. Specify what makes certain functions more important to document. Default is NULL (use standard priorities).
emphasis: Character string. Additional emphasis or context to guide the extraction process. Use this to highlight specific aspects of the task or to emphasize certain types of functions. Default is NULL.

Details

This function applies a "documentation necessity test": only include functions where a proficient LLM would struggle without explicit documentation and examples. This dramatically improves output quality and reduces token waste.

The enhanced prompt applies a rigorous "documentation necessity test" with four key questions:

1. Would a proficient LLM struggle without documentation? 2. Is this function domain-specific or universally known? 3. Does it use specialized terminology or workflows? 4. Would examples significantly improve usage accuracy?

**Automatic exclusions** (common functions that waste tokens): - Data I/O: read.csv, write.csv, readLines - Basic operations: order, sort, subset, head, tail - Simple statistics: mean, median, sd, sum - Core structures: c, list, data.frame - Well-known tidyverse: simple dplyr::filter, dplyr::mutate - Basic control flow: if, for, while - Common utilities: paste, grep, unique

**What gets included** (documentation-critical functions): - Domain-specific methods (clusterProfiler::enrichGO for GO analysis) - Complex statistical procedures (DESeq2::DESeq) - Specialized transformations (sf::st_transform for spatial data) - Functions with many non-obvious parameters - Methods where wrong usage produces plausible but incorrect results

This approach ensures that "GO enrichment analysis" returns clusterProfiler functions, NOT read.csv or order.

Examples

Run this code

# Basic usage
prompt <- package_extraction_prompt(
  "Perform GO enrichment analysis on differentially expressed genes"
)

# With domain-specific guidance
prompt <- package_extraction_prompt(
  task_description = "Single-cell RNA-seq analysis with Seurat",
  include_criteria = c(
    "Seurat-specific normalization and scaling methods"
  ),
  exclude_criteria = c(
    "Standard dplyr data manipulation"
  )
)

if (FALSE) {
# Use with retrieve_docs (requires LLM client)
docs <- retrieve_docs(
  chat_obj = llm,
  prompt = package_extraction_prompt(
    task_description = "Perform differential expression analysis"
  )
)
}