Evaluating Language Models
Language models can perform complex tasks. Evals help measure a model's ability to perform a specific task. Evals are defined as Spicepod components and can evaluate any Spicepod model's performance.
Refer to the Cookbook for related examples.
Overview​
In Spice, an eval consists of the following core components:
- Evals: A defined task for a model to perform and a method to measure its performance.
- Eval Run: An single evaluation of a specific model.
- Eval Result: The model output and score for a single input task within an eval run.
- Eval Scorer: A method to score the model's performance on an eval result.
Eval Components​
An eval component is defined as follows:
evals:
- name: australia
description: Make sure the model understands Aussies, and importantly Cricket.
dataset: cricket_questions
scorers:
- match
datasets:
- name: cricket_questions
from: https://github.com/openai/evals/raw/refs/heads/main/evals/registry/data/cricket_situations/samples.jsonl
Where:
name
is a unique identifier for this eval (likemodels
,datasets
, etc.).dataset
is a dataset component.scorers
is a list of scoring methods.
For complete details on the evals
component, see the Spicepod reference.
Running an Eval​
To run an eval, ensure
- Define an
eval
component (and it's associateddataset
). - Add a language model to the spicepod (this is the model that will be evaluated).
An eval can be started via the HTTP API:
curl -XPOST http://localhost:8090/v1/evals/australia \
-H 'Content-Type: application/json' \
-d '{
"model": "my_model",
}'
Depending on the dataset and model, the eval run can take some time to complete. On completion, results will be available in two tables:
eval.runs
: Summarises the status and scores from the eval run.eval.results
: Contains the input, expected output, and actual output for each eval run, and the score from each scorer.
Dataset Formats​
Datasets are used to define the input and expected output for an eval. Evals expect a particular format,
input
: The input to the model. It should be either:- A plain string (e.g.,
"Hello, how are you?"
), interpreted as a single user message. - A JSON array is interpreted as multiple OpenAI-compatible messages (e.g.,
[{"role":"system","content":"You are a helpful assistant."}, ...]
).
- A plain string (e.g.,
- For the
ideal
column:- A plain string (e.g.,
"I'm doing well, thanks!"
), interpreted as a single assistant response. - A JSON array is interpreted as multiple OpenAI-compatible choices (e.g.,
[{"index":0,"message":{"role":"assistant","content":"Sure!"}, ...}]
).
- A plain string (e.g.,
To use a dataset with a different format, use a view
. For example:
views:
# This view defines an eval dataset containing previous ai completion tasks from the `runtim.task_history` table.
- name: user_queries
sql: |
SELECT
json_get_json(input, 'messages') AS input,
json_get_str((captured_output -> 0), 'content') as ideal
FROM runtime.task_history
WHERE task='ai_completion'
Eval Scorers​
An eval scorer is a method to score the model's performance on a single eval case. A scorer is given the input given to the model, the models output and the expected output and produces an associated score. Spice has several out of the box scorers:
match
: Checks for an exact match between the expected and actual outputs.json_match
: Checks for an equivalent JSON between expected and actual outputs.includes
: Checks for the actual output to include the expected output.fuzzy_match
: Checks whether a normalised version (ignoring casing, punctuation, articles (e.g. a, the), excess whitespace) of either the expected and actual outputs are a subset of the other.levenshtein
: Computes the Levenshtein distance between the two output strings, normalised to the string length. The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
Spice has two other methods to define new scorers based on other spicepod components:
- Embedding models can be used to compute the similarity between the expected and actual output from the model being evaluated. Any
embeddings
model defined in thespicepod.yaml
is automatically available as a scorer. - Other language models can be used to judge the model being evaluated. This is often called an LLM-as-a-judge. Any
models
model defined in thespicepod.yaml
is automatically available as a scorer. Note, however, that these models should generally be configured purposefully to be a judge. There are also constraints the model must satisfy, see below.
Below is an example of an eval that uses all three: a builtin scorer, an embedding model scorer and an LLM judge.
evals:
- name: australia
description: Make sure the model understands Aussies, and importantly Cricket.
dataset: cricket_questions
scorers:
- hf_minilm
- judge
- match
embeddings:
- name: hf_minilm
from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
models:
- name: judge
from: openai:gpt-4o
params:
openai_api_key: ${ secrets:OPENAI_API_KEY }
parameterized_prompt: enabled
system_prompt: |
Score these two stories between 0.0 and 1.0 based on how similar their moral lesson is.
Story A: {{ .actual }}
Story B: {{ .ideal }}
openai_response_format:
type: json_schema
json_schema:
name: judge
schema:
type: object
properties:
score:
type: number
format: float
additionalProperties: true
required:
- score
strict: false
LLM-as-a-Judge​
Spicepod models can be used to provide eval scores for other models. To do so in Spice, the LLM must:
- Return a valid JSON as the response. The JSON must have at least a single number field
.score
. e.g.
{
"score": 0.42,
"rationale": "It was a good story, they both are about love."
}
- Use Parameterized prompts to provide details about the eval step. When used as an eval scorer, the model will be provided with the following variables:
input
, 'actual&
ideal`. The type of these variables will depend on the dataset, as per the dataset format.