
An LLM eval suite is a versioned set of 50 to 200 hand-labeled cases plus a small library of graders, run on every prompt or model change to detect regressions before they ship. Build it like a test suite: cases live next to your code, four grader types cover the spectrum, and CI fails the PR if pass-rate drops. The first useful version is one Python file and an afternoon of work.
The point is not to score your model. The point is to stop shipping by vibes.
Three jobs, in order of how often you will use them:
Teams without an eval suite spend the same hours, but on Slack threads about whether the new prompt is "actually better." With one, the answer is a number.
A grader takes a model output and returns a score (usually 0 or 1, sometimes a float). Most production suites use a mix of all four. Picking only one is a tell that someone has not run the suite against a real PR yet.
The cheapest grader. The output must equal a known string or value. Use it for classification labels, structured field extraction, tool-call names, and yes/no answers.
def exact_match(output, expected):
return 1.0 if output.strip() == expected.strip() else 0.0
Failure mode: too rigid. "Yes." and "yes" fail unless you normalize. Reserve exact match for cases where the answer space is tiny and known.
Regex, JSON-schema validation, code execution, tool-call shape checks. Anything you can write as a pure function.
import json, jsonschema
def valid_json_shape(output, schema):
try:
jsonschema.validate(json.loads(output), schema)
return 1.0
except Exception:
return 0.0
This is where 60% of your graders should live. If the model is producing structured output (a SQL query, a JSON object, a tool call), a deterministic grader is faster, cheaper, and more reproducible than asking another LLM to judge it.
A small model (often a fine-tuned BERT or a HuggingFace text classifier) scoring tone, toxicity, refusal patterns, or domain adherence. Useful when the property you care about is fuzzy but well-defined enough to train a classifier on.
Most teams skip this tier and jump straight to LLM-as-judge. That is fine for the first 200 cases. Adopt classifiers when you have enough labeled data and judge calls are eating your budget.
A larger model (typically Claude Sonnet or GPT-5) scores the output against a rubric. Use it for open-ended outputs: summaries, drafts, multi-turn agent transcripts, anything you cannot check with a regex.
JUDGE_PROMPT = """You are a strict grader. Score the candidate
answer 0 (wrong) or 1 (correct) against the reference.
Question: {question}
Reference: {reference}
Candidate: {candidate}
Return only the number."""
def llm_judge(question, reference, candidate, client):
msg = client.messages.create(
model="claude-sonnet-4-7",
max_tokens=4,
messages=[{"role": "user", "content": JUDGE_PROMPT.format(
question=question, reference=reference, candidate=candidate)}]
)
return float(msg.content[0].text.strip())
Failure mode: non-deterministic, expensive, and the judge has its own biases. Calibrate it against 50 human-labeled cases before you trust it. If the judge agrees with humans less than 85% of the time, fix the rubric or fall back to a deterministic grader. For more on judge calibration and prompt discipline, see prompt engineering for senior software engineers.
The graders are easy. The eval set is the work. A bad eval set means a green dashboard and broken production. Three rules:
1. Pull cases from real failures, not your imagination. Open your logs. Find the last 30 outputs that produced a customer complaint, a thumbs-down, a retry, or a support ticket. Those are your starter cases. They are guaranteed to matter.
2. Balance the set. A useful suite has roughly 60% happy path, 25% edge cases (long inputs, weird unicode, ambiguous phrasing), and 15% explicit regression cases (the exact bug you fixed last sprint). If you only label happy paths, your model can ship broken and pass the eval.
3. Hand-label, period. No synthetic ground truth. No "the old model's answer is correct." A human (ideally the engineer who owns the feature) writes the reference for every case. This is where eval suites die: someone tries to bootstrap labels with a model and ends up grading the model against itself.
Store cases as JSONL, one per line. Version it in git next to the prompt. When the prompt changes, the eval set is part of the diff.
{"id": "001", "input": "summarize: ...", "reference": "...", "tags": ["happy"]}
{"id": "002", "input": "summarize: <empty>", "reference": "I cannot summarize empty input.", "tags": ["edge"]}
You do not need a framework for the first 200 cases. Here is a complete harness:
import json, time, statistics
from anthropic import Anthropic
client = Anthropic()
def run_model(prompt, system):
r = client.messages.create(
model="claude-sonnet-4-7",
max_tokens=512,
system=system,
messages=[{"role": "user", "content": prompt}]
)
return r.content[0].text
def grade(case, output):
expected = case["reference"]
if case["grader"] == "exact":
return 1.0 if output.strip() == expected.strip() else 0.0
if case["grader"] == "contains":
return 1.0 if expected.lower() in output.lower() else 0.0
if case["grader"] == "judge":
return llm_judge(case["input"], expected, output, client)
raise ValueError(case["grader"])
def main(system_prompt, dataset_path, out_path):
cases = [json.loads(l) for l in open(dataset_path)]
results = []
for c in cases:
t0 = time.time()
out = run_model(c["input"], system_prompt)
score = grade(c, out)
results.append({"id": c["id"], "score": score,
"latency_ms": int((time.time() - t0) * 1000),
"output": out})
pass_rate = statistics.mean(r["score"] for r in results)
json.dump({"pass_rate": pass_rate, "results": results},
open(out_path, "w"), indent=2)
print(f"pass_rate={pass_rate:.3f} ({len(results)} cases)")
return pass_rate
if __name__ == "__main__":
main(open("prompts/system.md").read(),
"evals/dataset.jsonl",
"evals/latest.json")
Run it locally. Commit the JSON report. Diff it on PRs. That is a working eval suite. Most of the framework features you would adopt later (dashboards, parallel runs, trace storage) are nice to have, not blockers. For the model selection that feeds into this loop, Claude Opus vs Sonnet vs Haiku: when to use which is a useful reference.
Once you have 200 cases and three engineers running evals, the framework question gets real. Honest matrix:
| Tool | Strength | Weakness | Best for |
|---|---|---|---|
| Braintrust | Best UI for diffing eval runs side by side | SaaS-only, paid, vendor lock-in on traces | Funded teams who want a polished dashboard fast |
| Phoenix by Arize | Open source, strong tracing, OpenTelemetry-native | Setup heavier than Promptfoo, fewer canned graders | Teams already running OTel infra |
| LangSmith | Tight LangChain integration, dataset versioning | Awkward outside LangChain, pricing climbs fast | LangChain-native stacks |
| Promptfoo (OSS) | YAML config, ships in an hour, great CI story | Less expressive for long agent traces | Solo founders, first eval suite |
| Inspect AI (UK AISI) | Strong agent eval primitives, sandboxing | Steeper learning curve | Agent-heavy codebases, safety evals |
| Custom (the 80 lines above) | Zero lock-in, fork it in an afternoon | You own the maintenance | First 200 cases, then re-evaluate |
The honest answer for most teams shipping their first LLM feature: start with the 80-line custom harness or Promptfoo. Move to Braintrust or Phoenix only when the dashboard pain is real. Picking a heavyweight framework before you have 200 hand-labeled cases is a common way to spend three weeks configuring instead of grading. This same anti-pattern shows up in production RAG architecture work, where teams adopt a vector DB before they have ground truth to evaluate retrieval against.
The eval suite earns its keep in CI. The pattern:
prompts/ or models/, run the full eval suite.main.GitHub Actions sketch:
name: evals
on:
pull_request:
paths: ['prompts/**', 'evals/**', 'lib/llm/**']
jobs:
run:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r evals/requirements.txt
- run: python evals/run.py
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- run: python evals/compare.py main HEAD --max-drop=0.02
- uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const diff = fs.readFileSync('evals/diff.md', 'utf8');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: diff
});
The 2-point threshold is a starting heuristic. Loosen it if your suite is small (statistical noise dominates). Tighten it once you cross 200 cases. For longer-term robustness, pair the gate with the hallucination defenses you would deploy in production LLM apps, so the eval suite is catching what the runtime guardrails do not.
main. Save the JSON report to evals/baseline.json.prompts/, models/, or lib/llm/. Fail the PR if pass-rate drops more than 2 points. Post the diff as a PR comment.That is the whole loop. Iterate by adding cases every time a bug ships.
A focused mid-level engineer ships the first version in three to five days. A senior engineer ships it in two and adds the CI integration. Lead-tier work is for when you need an eval-as-a-service layer across ten LLM features.
On Cadence, that maps cleanly to the pricing tiers. Junior at $500 per week handles dataset cleanup and prompt-version hygiene. Mid at $1,000 per week ships the harness and the first 100 cases. Senior at $1,500 per week owns the CI integration and the LLM-as-judge calibration. Lead at $2,000 per week is for the multi-feature platform play.
Every Cadence engineer is AI-native by default, vetted on Cursor, Claude Code, and Copilot fluency through a founder-led voice interview before they unlock bookings. There is no non-AI-native option, which matters here: an eval suite is a meta-LLM workflow, and engineers who think in prompt-as-spec discipline build them faster than engineers who treat the LLM as a black box. Median time to first commit on Cadence is 27 hours from booking, so a five-day project starts on day one, not day six.
If you want help scoping the work before you book, decide your next feature with a Build / Buy / Book recommendation takes about two minutes.
Trying to ship the eval suite this sprint? Book a vetted Cadence engineer with a 48-hour free trial, weekly billing, and no notice period. Most teams get a working harness and the first 100 hand-labeled cases in five days.
Start with 20 to 50 cases drawn from real failure logs. Scale to 100 to 200 once patterns appear. Past 500, you usually have redundancy rather than new coverage. The marginal case stops teaching you anything new.
Use deterministic graders whenever the answer is checkable: exact match, regex, JSON shape, tool-call name. Reserve LLM-as-judge for open-ended outputs (summaries, drafts, agent transcripts) where rubric scoring is the only option. A healthy suite is roughly 70% deterministic and 30% judge.
Run the full suite only on PRs that touch prompts, models, or LLM glue code (gated by CI path filters). Run a 20-case smoke subset on every other PR. Cache LLM-judge calls keyed on the hash of input plus output, so a re-run on unchanged cases costs nothing.
Not for the first 200 cases. A single Python or TypeScript file with four grader functions and a JSON report is enough. Adopt a framework once you need shared dashboards, multi-tenant traces, or eval-set governance across more than three engineers.
If it caught the last production bug before the bug shipped, it works. If the same pass-rate hides quality regressions you find in production, the eval set is too narrow; add the missing cases. The suite is a living artifact, not a one-time deliverable.