How to build an LLM eval suite from scratch

An LLM eval suite is a versioned set of 50 to 200 hand-labeled cases plus a small library of graders, run on every prompt or model change to detect regressions before they ship. Build it like a test suite: cases live next to your code, four grader types cover the spectrum, and CI fails the PR if pass-rate drops. The first useful version is one Python file and an afternoon of work.

The point is not to score your model. The point is to stop shipping by vibes.

What an LLM eval suite actually does

Three jobs, in order of how often you will use them:

Regression detection on prompt and model changes. You tweak the system prompt to fix one bug. The eval suite tells you whether you broke five other cases in the process. Without it, you find out from a customer.
A/B between approaches. RAG vs. tool calling. Sonnet 4.6 vs. Sonnet 4.7. Few-shot vs. zero-shot. The suite lets you decide with numbers, not opinions.
Gate model upgrades. When a new model drops, you run the suite. If pass-rate climbs, you upgrade. If it drops, you stay. Anthropic's own engineering team recommends starting with 20 to 50 cases drawn from real production failures, and scaling from there.

Teams without an eval suite spend the same hours, but on Slack threads about whether the new prompt is "actually better." With one, the answer is a number.

The four grader types you need

A grader takes a model output and returns a score (usually 0 or 1, sometimes a float). Most production suites use a mix of all four. Picking only one is a tell that someone has not run the suite against a real PR yet.

1. Exact match

The cheapest grader. The output must equal a known string or value. Use it for classification labels, structured field extraction, tool-call names, and yes/no answers.

def exact_match(output, expected):
    return 1.0 if output.strip() == expected.strip() else 0.0

Failure mode: too rigid. "Yes." and "yes" fail unless you normalize. Reserve exact match for cases where the answer space is tiny and known.

2. Deterministic rules

Regex, JSON-schema validation, code execution, tool-call shape checks. Anything you can write as a pure function.

import json, jsonschema

def valid_json_shape(output, schema):
    try:
        jsonschema.validate(json.loads(output), schema)
        return 1.0
    except Exception:
        return 0.0

This is where 60% of your graders should live. If the model is producing structured output (a SQL query, a JSON object, a tool call), a deterministic grader is faster, cheaper, and more reproducible than asking another LLM to judge it.

3. Classifier graders

A small model (often a fine-tuned BERT or a HuggingFace text classifier) scoring tone, toxicity, refusal patterns, or domain adherence. Useful when the property you care about is fuzzy but well-defined enough to train a classifier on.

Most teams skip this tier and jump straight to LLM-as-judge. That is fine for the first 200 cases. Adopt classifiers when you have enough labeled data and judge calls are eating your budget.

4. LLM-as-judge

A larger model (typically Claude Sonnet or GPT-5) scores the output against a rubric. Use it for open-ended outputs: summaries, drafts, multi-turn agent transcripts, anything you cannot check with a regex.

JUDGE_PROMPT = """You are a strict grader. Score the candidate
answer 0 (wrong) or 1 (correct) against the reference.
Question: {question}
Reference: {reference}
Candidate: {candidate}
Return only the number."""

def llm_judge(question, reference, candidate, client):
    msg = client.messages.create(
        model="claude-sonnet-4-7",
        max_tokens=4,
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            question=question, reference=reference, candidate=candidate)}]
    )
    return float(msg.content[0].text.strip())

Failure mode: non-deterministic, expensive, and the judge has its own biases. Calibrate it against 50 human-labeled cases before you trust it. If the judge agrees with humans less than 85% of the time, fix the rubric or fall back to a deterministic grader. For more on judge calibration and prompt discipline, see prompt engineering for senior software engineers.

Curating the eval set: 50 to 200 hand-labeled cases

The graders are easy. The eval set is the work. A bad eval set means a green dashboard and broken production. Three rules:

1. Pull cases from real failures, not your imagination. Open your logs. Find the last 30 outputs that produced a customer complaint, a thumbs-down, a retry, or a support ticket. Those are your starter cases. They are guaranteed to matter.

2. Balance the set. A useful suite has roughly 60% happy path, 25% edge cases (long inputs, weird unicode, ambiguous phrasing), and 15% explicit regression cases (the exact bug you fixed last sprint). If you only label happy paths, your model can ship broken and pass the eval.

3. Hand-label, period. No synthetic ground truth. No "the old model's answer is correct." A human (ideally the engineer who owns the feature) writes the reference for every case. This is where eval suites die: someone tries to bootstrap labels with a model and ends up grading the model against itself.

Store cases as JSONL, one per line. Version it in git next to the prompt. When the prompt changes, the eval set is part of the diff.

{"id": "001", "input": "summarize: ...", "reference": "...", "tags": ["happy"]}
{"id": "002", "input": "summarize: <empty>", "reference": "I cannot summarize empty input.", "tags": ["edge"]}

A working eval harness in 80 lines of Python

You do not need a framework for the first 200 cases. Here is a complete harness:

import json, time, statistics
from anthropic import Anthropic

client = Anthropic()

def run_model(prompt, system):
    r = client.messages.create(
        model="claude-sonnet-4-7",
        max_tokens=512,
        system=system,
        messages=[{"role": "user", "content": prompt}]
    )
    return r.content[0].text

def grade(case, output):
    expected = case["reference"]
    if case["grader"] == "exact":
        return 1.0 if output.strip() == expected.strip() else 0.0
    if case["grader"] == "contains":
        return 1.0 if expected.lower() in output.lower() else 0.0
    if case["grader"] == "judge":
        return llm_judge(case["input"], expected, output, client)
    raise ValueError(case["grader"])

def main(system_prompt, dataset_path, out_path):
    cases = [json.loads(l) for l in open(dataset_path)]
    results = []
    for c in cases:
        t0 = time.time()
        out = run_model(c["input"], system_prompt)
        score = grade(c, out)
        results.append({"id": c["id"], "score": score,
                        "latency_ms": int((time.time() - t0) * 1000),
                        "output": out})
    pass_rate = statistics.mean(r["score"] for r in results)
    json.dump({"pass_rate": pass_rate, "results": results},
              open(out_path, "w"), indent=2)
    print(f"pass_rate={pass_rate:.3f} ({len(results)} cases)")
    return pass_rate

if __name__ == "__main__":
    main(open("prompts/system.md").read(),
         "evals/dataset.jsonl",
         "evals/latest.json")

Run it locally. Commit the JSON report. Diff it on PRs. That is a working eval suite. Most of the framework features you would adopt later (dashboards, parallel runs, trace storage) are nice to have, not blockers. For the model selection that feeds into this loop, Claude Opus vs Sonnet vs Haiku: when to use which is a useful reference.

Tools: Braintrust, Phoenix, LangSmith, Promptfoo, Inspect AI, or roll your own

Once you have 200 cases and three engineers running evals, the framework question gets real. Honest matrix:

Tool	Strength	Weakness	Best for
Braintrust	Best UI for diffing eval runs side by side	SaaS-only, paid, vendor lock-in on traces	Funded teams who want a polished dashboard fast
Phoenix by Arize	Open source, strong tracing, OpenTelemetry-native	Setup heavier than Promptfoo, fewer canned graders	Teams already running OTel infra
LangSmith	Tight LangChain integration, dataset versioning	Awkward outside LangChain, pricing climbs fast	LangChain-native stacks
Promptfoo (OSS)	YAML config, ships in an hour, great CI story	Less expressive for long agent traces	Solo founders, first eval suite
Inspect AI (UK AISI)	Strong agent eval primitives, sandboxing	Steeper learning curve	Agent-heavy codebases, safety evals
Custom (the 80 lines above)	Zero lock-in, fork it in an afternoon	You own the maintenance	First 200 cases, then re-evaluate

The honest answer for most teams shipping their first LLM feature: start with the 80-line custom harness or Promptfoo. Move to Braintrust or Phoenix only when the dashboard pain is real. Picking a heavyweight framework before you have 200 hand-labeled cases is a common way to spend three weeks configuring instead of grading. This same anti-pattern shows up in production RAG architecture work, where teams adopt a vector DB before they have ground truth to evaluate retrieval against.

Wiring it into CI: fail the PR on regression

The eval suite earns its keep in CI. The pattern:

On every PR that touches prompts/ or models/, run the full eval suite.
Compare the new pass-rate against main.
If it drops by more than 2 percentage points, fail the PR. Otherwise comment the diff.

GitHub Actions sketch:

name: evals
on:
  pull_request:
    paths: ['prompts/**', 'evals/**', 'lib/llm/**']
jobs:
  run:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r evals/requirements.txt
      - run: python evals/run.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - run: python evals/compare.py main HEAD --max-drop=0.02
      - uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const diff = fs.readFileSync('evals/diff.md', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: diff
            });

The 2-point threshold is a starting heuristic. Loosen it if your suite is small (statistical noise dominates). Tighten it once you cross 200 cases. For longer-term robustness, pair the gate with the hallucination defenses you would deploy in production LLM apps, so the eval suite is catching what the runtime guardrails do not.

Steps

Define the task. Write one paragraph: what input shape goes in, what output shape comes out, what success looks like. If you cannot write it, you cannot eval it.
Curate the eval set. Pull 50 cases from real failure logs. Hand-label the reference for each. Tag as happy / edge / regression. Commit as JSONL next to the prompt.
Write the graders. Start with exact match and contains. Add JSON-schema validation if you have structured output. Add LLM-as-judge only for cases the deterministic graders cannot cover.
Run the eval. Use the 80-line harness above. Get a baseline pass-rate on main. Save the JSON report to evals/baseline.json.
Wire into CI. GitHub Action triggers on changes to prompts/, models/, or lib/llm/. Fail the PR if pass-rate drops more than 2 points. Post the diff as a PR comment.

That is the whole loop. Iterate by adding cases every time a bug ships.

Who should build this and how long it takes

A focused mid-level engineer ships the first version in three to five days. A senior engineer ships it in two and adds the CI integration. Lead-tier work is for when you need an eval-as-a-service layer across ten LLM features.

On Cadence, that maps cleanly to the pricing tiers. Junior at $500 per week handles dataset cleanup and prompt-version hygiene. Mid at $1,000 per week ships the harness and the first 100 cases. Senior at $1,500 per week owns the CI integration and the LLM-as-judge calibration. Lead at $2,000 per week is for the multi-feature platform play.

Every Cadence engineer is AI-native by default, vetted on Cursor, Claude Code, and Copilot fluency through a founder-led voice interview before they unlock bookings. There is no non-AI-native option, which matters here: an eval suite is a meta-LLM workflow, and engineers who think in prompt-as-spec discipline build them faster than engineers who treat the LLM as a black box. Median time to first commit on Cadence is 27 hours from booking, so a five-day project starts on day one, not day six.

If you want help scoping the work before you book, decide your next feature with a Build / Buy / Book recommendation takes about two minutes.

Trying to ship the eval suite this sprint? Book a vetted Cadence engineer with a 48-hour free trial, weekly billing, and no notice period. Most teams get a working harness and the first 100 hand-labeled cases in five days.

FAQ

How many cases does a useful LLM eval suite need?

Start with 20 to 50 cases drawn from real failure logs. Scale to 100 to 200 once patterns appear. Past 500, you usually have redundancy rather than new coverage. The marginal case stops teaching you anything new.

Should I use LLM-as-judge or write deterministic graders?

Use deterministic graders whenever the answer is checkable: exact match, regex, JSON shape, tool-call name. Reserve LLM-as-judge for open-ended outputs (summaries, drafts, agent transcripts) where rubric scoring is the only option. A healthy suite is roughly 70% deterministic and 30% judge.

How do I keep eval costs from blowing up?

Run the full suite only on PRs that touch prompts, models, or LLM glue code (gated by CI path filters). Run a 20-case smoke subset on every other PR. Cache LLM-judge calls keyed on the hash of input plus output, so a re-run on unchanged cases costs nothing.

Do I need a framework like DeepEval, LangSmith, or Braintrust?

Not for the first 200 cases. A single Python or TypeScript file with four grader functions and a JSON report is enough. Adopt a framework once you need shared dashboards, multi-tenant traces, or eval-set governance across more than three engineers.

How do I know if my eval suite is good?

If it caught the last production bug before the bug shipped, it works. If the same pass-rate hides quality regressions you find in production, the eval set is too narrow; add the missing cases. The suite is a living artifact, not a one-time deliverable.

All posts