Run a benchmark suite against an agent and collect performance metrics.
benchmark_agent(agent, tasks, tools = NULL, verbose = TRUE)A benchmark result object with metrics.
An Agent object or model string.
A list of benchmark tasks (see details).
Optional list of tools for the agent.
Print progress.
Each task in the tasks list should have:
prompt: The task prompt
expected: Expected output or criteria
category: Optional category for grouping
ground_truth: Optional ground truth for hallucination checking