Evaluation Task
s provide a flexible data structure for evaluating LLM-based
tools.
Datasets contain a set of labelled samples. Datasets are just a
tibble with columns input
and target
, where input
is a prompt
and target
is either literal value(s) or grading guidance.
Solvers evaluate the input
in the dataset and produce a final result.
Scorers evaluate the final output of solvers. They may use text
comparisons (like detect_match()
), model grading (like
model_graded_qa()
), or other custom schemes.
The usual flow of LLM evaluation with Tasks calls $new()
and then $eval()
.
$eval()
just calls $solve()
, $score()
, $measure()
, $log()
,
and $view()
in order. The remaining methods are generally only
recommended for expert use.
dir
The directory where evaluation logs will be written to. Defaults
to vitals_log_dir()
.
metrics
A named vector of metric values resulting from $measure()
(called inside of $eval()
). Will be NULL
if metrics have yet to
be applied.
new()
The typical flow of LLM evaluation with vitals tends to involve first
calling this method and then $eval()
on the resulting object.
Task$new(
dataset,
solver,
scorer,
metrics = NULL,
epochs = NULL,
name = deparse(substitute(dataset)),
dir = vitals_log_dir()
)
dataset
A tibble with, minimally, columns input
and target
.
solver
A function that takes a vector of inputs from the
dataset's input
column as its first argument and determines values
approximating dataset$target
. Its return value must be a list with
the following elements:
result
- A character vector of the final responses, with the same length
as dataset$input
.
solver_chat
- A list of ellmer Chat objects that were used to solve
each input, also with the same length as dataset$input
.
Additional output elements can be included in a slot solver_metadata
that
has the same length as dataset$input
, which will be logged in
solver_metadata
.
Additional arguments can be passed to the solver via $solve(...)
or $eval(...)
. See the definition of generate()
for a function that
outputs a valid solver that just passes inputs to ellmer Chat objects'
$chat()
method in parallel.
scorer
A function that evaluates how well the solver's return value
approximates the corresponding elements of dataset$target
. The function
should take in the $get_samples()
slot of a Task object and return a list with
the following elements:
score
- A vector of scores with length equal to nrow(samples)
.
Built-in scorers return ordered factors with
levels I
< P
(optionally) < C
(standing for "Incorrect", "Partially
Correct", and "Correct"). If your scorer returns this output type, the
package will automatically calculate metrics.
Optionally:
scorer_chat
- If your scorer makes use of ellmer, also include a list of
ellmer Chat objects that were used to score each result, also with
length nrow(samples)
.
scorer_metadata
- Any intermediate results or other values that you'd
like to be stored in the persistent log. This should also have length
equal to nrow(samples)
.
Scorers will probably make use of samples$input
, samples$target
, and
samples$result
specifically. See model-based scoring
for examples.
metrics
A named list of functions that take in a vector of scores
(as in task$get_samples()$score
) and output a single numeric value.
epochs
The number of times to repeat each sample. Evaluate each sample
multiple times to better quantify variation. Optional, defaults to 1L
.
The value of epochs
supplied to $eval()
or $score()
will take
precedence over the value in $new()
.
name
A name for the evaluation task. Defaults to
deparse(substitute(dataset))
.
dir
Directory where logs should be stored.
A new Task object.
eval()
Evaluates the task by running the solver, scorer, logging results, and
viewing (if interactive). This method works by calling $solve()
,
$score()
, $log()
, and $view()
in sequence.
The typical flow of LLM evaluation with vitals tends to involve first
calling $new()
and then this method on the resulting object.
Task$eval(..., epochs = NULL, view = interactive())
...
Additional arguments passed to the solver and scorer functions.
epochs
The number of times to repeat each sample. Evaluate each sample
multiple times to better quantify variation. Optional, defaults to 1L
.
The value of epochs
supplied to $eval()
or $score()
will take
precedence over the value in $new()
.
view
Automatically open the viewer after evaluation (defaults to TRUE if interactive, FALSE otherwise).
The Task object (invisibly)
get_samples()
The task's samples represent the evaluation in a data frame format.
vitals_bind()
row-binds the output of this
function called across several tasks.
Task$get_samples()
A tibble representing the evaluation. Based on the dataset
,
epochs
may duplicate rows, and the solver and scorer will append
columns to this data.
...
Additional arguments passed to the solver function.
epochs
The number of times to repeat each sample. Evaluate each sample
multiple times to better quantify variation. Optional, defaults to 1L
.
The value of epochs
supplied to $eval()
or $score()
will take
precedence over the value in $new()
.
The Task object (invisibly)
score()
Score the task by running the scorer and then applying metrics to its results.
Task$score(...)
...
Additional arguments passed to the scorer function.
The Task object (invisibly)
measure()
Applies metrics to a scored Task.
Task$measure()
The Task object (invisibly)
log()
Log the task to a directory.
Note that, if an VITALS_LOG_DIR
envvar is set, this will happen
automatically in $eval()
.
Task$log(dir = vitals_log_dir())
dir
The directory to write the log to.
The path to the logged file, invisibly.
view()
View the task results in the Inspect log viewer
Task$view()
The Task object (invisibly)
set_solver()
Set the solver function
Task$set_solver(solver)
solver
A function that takes a vector of inputs from the
dataset's input
column as its first argument and determines values
approximating dataset$target
. Its return value must be a list with
the following elements:
result
- A character vector of the final responses, with the same length
as dataset$input
.
solver_chat
- A list of ellmer Chat objects that were used to solve
each input, also with the same length as dataset$input
.
Additional output elements can be included in a slot solver_metadata
that
has the same length as dataset$input
, which will be logged in
solver_metadata
.
Additional arguments can be passed to the solver via $solve(...)
or $eval(...)
. See the definition of generate()
for a function that
outputs a valid solver that just passes inputs to ellmer Chat objects'
$chat()
method in parallel.
The Task object (invisibly)
set_scorer()
Set the scorer function
Task$set_scorer(scorer)
scorer
A function that evaluates how well the solver's return value
approximates the corresponding elements of dataset$target
. The function
should take in the $get_samples()
slot of a Task object and return a list with
the following elements:
score
- A vector of scores with length equal to nrow(samples)
.
Built-in scorers return ordered factors with
levels I
< P
(optionally) < C
(standing for "Incorrect", "Partially
Correct", and "Correct"). If your scorer returns this output type, the
package will automatically calculate metrics.
Optionally:
scorer_chat
- If your scorer makes use of ellmer, also include a list of
ellmer Chat objects that were used to score each result, also with
length nrow(samples)
.
scorer_metadata
- Any intermediate results or other values that you'd
like to be stored in the persistent log. This should also have length
equal to nrow(samples)
.
Scorers will probably make use of samples$input
, samples$target
, and
samples$result
specifically. See model-based scoring
for examples.
The Task object (invisibly)
set_metrics()
Set the metrics that will be applied in $measure()
(and thus $eval()
).
Task$set_metrics(metrics)
metrics
A named list of functions that take in a vector of scores
(as in task$get_samples()$score
) and output a single numeric value.
The Task (invisibly)
get_cost()
The cost of this eval
This is a wrapper around ellmer's $token_usage()
function.
That function is called at the beginning and end of each call to
$solve()
and $score()
; this function returns the cost inferred
by taking the differences in values of $token_usage()
over time.
Task$get_cost()
A tibble displaying the cost of solving and scoring the evaluation by model, separately for the solver and scorer.
clone()
The objects of this class are cloneable with this method.
Task$clone(deep = FALSE)
deep
Whether to make a deep clone.
generate()
for the simplest possible solver, and
scorer_model and scorer_detect for two built-in approaches to
scoring.
if (!identical(Sys.getenv("ANTHROPIC_API_KEY"), "")) {
# set the log directory to a temporary directory
withr::local_envvar(VITALS_LOG_DIR = withr::local_tempdir())
library(ellmer)
library(tibble)
simple_addition <- tibble(
input = c("What's 2+2?", "What's 2+3?"),
target = c("4", "5")
)
# create a new Task
tsk <- Task$new(
dataset = simple_addition,
solver = generate(chat_anthropic(model = "claude-3-7-sonnet-latest")),
scorer = model_graded_qa()
)
# evaluate the task (runs solver and scorer) and opens
# the results in the Inspect log viewer (if interactive)
tsk$eval()
# $eval() is shorthand for:
tsk$solve()
tsk$score()
tsk$measure()
tsk$log()
tsk$view()
# get the evaluation results as a data frame
tsk$get_samples()
# view the task directory with $view() or vitals_view()
vitals_view()
}
Run the code above in your browser using DataLab