Agent Evaluation Playbook¶

This playbook is a practical guide to evaluating AI agents without the ceremony. The goal isn't "more dashboards" — it's making the agent better in the real world, on the things your users actually do.

Philosophy

Agents are sports cars.
Evals are the seatbelt.

Seatbelts don't steer. They don't tune the engine. They don't make you "better." They just keep you from eating the steering wheel when something goes wrong. If your stance is "we don't need evals because we're careful," you're confusing confidence with safety.

Mindset

This is just data science.

Same fundamentals as traditional ML — dataset, labels, error analysis, iteration. If you're paying $100/seat for an eval tool and still not looking at your data, you're doing it wrong. Expensive tools don't solve the problem — they just scale whatever process you already have.

1

Evals are adult supervision

Not rocket science. Simple checks that prevent simple disasters. Start here, stay here, expand only when the basics are solid.

2

Define "good" in one sentence

If you can't describe success, no framework will save you. Clarity first, tooling second.

3

Start with pass/fail

Add nuance only after you've earned it. A binary gate catches more failures than a sophisticated rubric you never run.

4

Calibrate your judges

LLM judges are useful. They're also liars with confidence. Calibrate them against humans or don't pretend you measured anything.

The Problem with Generic Evaluation¶

Evaluating AI agents isn't a traditional machine learning task. It's a context-driven problem—requiring more than generic metrics or off-the-shelf tools. Generative agents are unpredictable and can be highly domain-specific, so success demands a tailored approach rooted in:

Deep domain expertise
Context-aware error analysis
Metrics tied to real business outcomes

There's no silver bullet. One-size-fits-all evaluation leads to mediocrity. The real question is: Is this agent improving at the job it was designed to do?

A customer support agent and an onboarding agent may share similarities, but grading them with the same rubric is like judging a painter and a sculptor by identical standards.

Stop confusing activity with progress

Shipping a new dashboard is not the same as shipping reliability. If you can't point to specific failure modes that got better, you didn't "improve the agent" — you just looked at it differently.

Four Common Mistakes¶

Before diving into the process, avoid these common pitfalls:

1

Not Looking at Your Data

The most common and most damaging mistake. Without reviewing raw examples, you're not truly evaluating.

2

Frameworks Before Fundamentals

If your first question is "What framework?", you're starting in the wrong place.

3

Overreliance on Generic Evals

Generic metrics are not a shortcut. If your evals aren't grounded in known errors, you're wasting time.

4

Misusing LLM-as-a-Judge

LLMs can assist evaluation, but relying on them without human validation is a recipe for false confidence.

1. Not Looking at Your Data¶

The most common—and most damaging—mistake is neglecting to examine your data. Despite being fundamental, this step is often skipped.

Effective evaluation begins with frictionless, high-quality data exploration and labeling. Without a deep understanding of your agent's failures, any metrics or conclusions you draw are on shaky ground.

If you haven't spent time reviewing raw examples, you're not truly evaluating.

If this feels "manual," good

Start manually. Use spreadsheets. Label failures. If you can't do it by hand for 30 examples, you're not ready to automate it for 30,000.

2. Frameworks Before Fundamentals¶

Jumping to tools or frameworks before establishing a process is a red flag.

If your first question is "What framework should I use?", you're starting in the wrong place. Evaluation is not a plug-and-play task. You must define what success looks like for your agent. Frameworks are secondary—they should support your workflow, not define it.

3. Overreliance on Generic Evaluations¶

Generic metrics are not a shortcut—they're a risk.

If your evals aren't grounded in known errors, you're wasting time. Off-the-shelf metrics (e.g., hallucination, toxicity, relevancy) are easy to reach for—but dangerous when misused. Agents are purpose-built. So should your evaluations be.

Generic metrics may be useful after understanding your agent's unique failure modes, but using them prematurely becomes a crutch that masks real issues.

4. Misusing LLM-as-a-Judge¶

Large language models can assist in evaluation, but they must be used responsibly.

Relying on LLMs without first validating them against domain expert feedback is a recipe for false confidence. The correct approach:

Build a lightweight review pipeline
Have experts label a few dozen examples
Align your LLM judge to match that standard

Until you achieve strong agreement between human and model judgment, LLM outputs should be treated as advisory—not definitive.

Don't confuse "a score" with "truth"

LLM judges are great at sounding right. That's the problem. If you didn't calibrate against humans, you're just grading your homework with the same model class.

The Analyze-Measure-Improve Methodology¶

The AMI cycle isn't new—it's long been used in traditional ML. But with generative agents, the game has changed. We've moved from predictable single-token outputs to freeform systems with far greater complexity.

If you approach evaluation as a simple side-by-side comparison or a table of results, you'll miss the deeper nuances in your data. True progress requires going beyond surface-level metrics to uncover root causes and meaningful insights.

This is just data science (with a different mindset)

The best agent eval workflows look suspiciously like traditional ML evaluation, because they are.

Data scientists have been doing this for years:

Curate a representative dataset
Label outcomes (often binary)
Do error analysis and slice breakdowns
Set thresholds, ship changes, measure regressions

What's "new" with agents is mostly the mindset shift:

Outputs are messier and more subjective
Failures are multi-causal (prompt, retrieval, tools, policy)
You need to write down "good" explicitly, or your metrics will quietly drift into vibes

graph LR
    A["ANALYZE"] -->|"Map what's broken"| M["MEASURE"]
    M -->|"Quantify impact"| I["IMPROVE"]
    I -->|"Validate & iterate"| A

Most AI agent problems aren't solved with a single tweak—they're solved with a disciplined, repeatable process. The AMI lifecycle replaces one-off troubleshooting with a data-driven, continuous improvement loop that:

Cuts failures
Speeds time to value
Manages risk
Drives measurable gains

Phase 1: ANALYZE¶

Outcome: A clear map of what's broken and why.

Collect representative agent interactions from real-world or test environments
Spot and categorize failure patterns (e.g., misunderstood prompts, inconsistent answers)
Trace root causes: Specification Issues (prompt ambiguity) vs. Generalization Issues (context mismatch)

Phase 2: MEASURE¶

Outcome: Solid baselines and clear priorities grounded in data.

Build evaluators targeted at each failure type
Quantify how often and how badly issues occur across scenarios
Prioritize fixes by business impact, surfaced in dashboards everyone can trust

Phase 3: IMPROVE¶

Outcome: Fixes that address root causes—not just symptoms.

Tighten prompts, definitions, and examples to fix Specification Issues
Enrich data and refine model/prompt pipelines for Generalization Issues
Roll out changes in controlled environments and validate improvements with metrics

And do it again, and again, and again. Once you can run this loop on purpose, then (and only then) scale it with tooling. Don't stop looking at the data just because you hired a dashboard.

Dataset Best Practices¶

A strong evaluation starts with the dataset. Your dataset should be grounded in your agent's errors.

Diversity Matters¶

Comprehensive Testing: Covers a wide range of situations so the AI is evaluated fairly
Realistic Interactions: Reflects actual user behavior for relevant evaluations
Weakness Discovery: Surfaces areas where the AI struggles or produces errors

How Many Examples Do You Need?¶

Start with ~30 examples. Continue adding until no new failure modes appear. Stop when additional examples stop revealing new insights.

Treat your dataset like a crash report backlog

Every new failure case is a free bug report. Collect them. Name them. Fix them. If you can't describe the failure in plain English, you can't evaluate it.

Coverage Dimensions¶

Your golden dataset must be a representative sample of the overall population of queries your end users actually ask:

Dimension	What to Cover
Intent Coverage	All major intents + common "out-of-scope" questions
Utterance Variation	Formal, casual, slang, misspellings, abbreviations, verbose
Complexity	Multi-turn dialogues, corrections, partial information
Sentiment & Tone	Frustrated, urgent, confused, polite interactions
Edge Cases	Rare but valid queries, known failure modes

Dataset Lifecycle¶

Phase	Focus
Formation	Curate high-quality real-world utterances with expert validation
Maintenance	Continuous review cycles with clear governance
Expansion	Controlled generation and targeted mining of edge cases

Defining Evaluation Criteria¶

Binary pass/fail judgments are important, but often not enough. Each judgment must be paired with a detailed critique.

Why Critiques Matter¶

Capture Nuances: Note if something was mostly correct but had areas for improvement
Guide Improvement: Provide specific insights into how the AI can be enhanced
Balance Simplicity with Depth: Pass/fail offers a clear verdict; critique offers reasoning

Example: Help Agent Evaluation¶

Interaction	Judgment	Critique
User asks how to create a custom field. Agent provides accurate step-by-step navigation.	Pass	Technically correct and actionable. However, missed opportunities to check permissions first or ask about field type requirements. Met primary need—passes with enhancement opportunities noted.
User asks about lead scoring. Agent says "Contact your administrator."	Fail	Made assumptions without checking org capabilities. Failed to explore alternatives or explain what lead scoring entails. Dismissive rather than helpful. Due to poor investigation and lack of solutions—fails.

The critique should be detailed enough to use in a few-shot prompt for an LLM judge.

Pass/fail is the seatbelt latch

Pass/fail gives you clarity and velocity. Critiques tell you what to change so tomorrow's run is better than today's.

Error Classification¶

Once you know where the errors are, perform an error analysis to identify root causes. The most effective approach is manually classifying examples using a spreadsheet.

Example Classification¶

Root Cause	Count	Percentage
Insufficient Workflow Guidance	12	40%
Ignoring User Context	8	27%
Generic Documentation Links	6	20%
Missing Permission Checks	4	13%

Now you know where to focus your efforts. This classification helps identify whether the problem lies in:

Prompt engineering
Knowledge retrieval
Contextual reasoning

Metrics Philosophy¶

Metrics Without Actionable Meaning are Worthless¶

Observability and evaluation are not the same thing. Observability tells you what happened; evaluation is about making agents better—an iterative loop of analyzing, measuring, and improving.

Why Likert Scales Fail¶

Problem	Explanation
Numbers Aren't Actionable	A "3" doesn't tell you how to improve
Subjectivity	Evaluators disagree on what qualifies as "minor inaccuracies" vs "some factual errors"
Metrics Rarely Reflect Expert Judgment	Domain experts make binary calls—approve vs. reject

The Better Approach¶

Binary outcomes paired with clear critiques surface what actually matters and provide a direct path for improvement.

Rule of Thumb: If you can't clearly explain what a score of 0.8 means—or what action it implies—it's not a useful metric.

Metric Selection Guide¶

The most common failure mode is "Metric Spaghetti"—throwing every metric at every test case. This burns tokens, increases latency, and creates noise.

Pick the few that matter

Start with 2–5 metrics that directly map to real failures. If you can't explain why a metric exists in one sentence, delete it.

Selection by Dimension¶

Retrieval & Context (Did we find the right data?)¶

Metric	Type	Use When
`HitRateAtK`	Math	"Did the right doc appear in the top 5?" (Binary)
`MeanReciprocalRank`	Math	You care about position
`ContextualRelevancy`	LLM	No ground truth ranking—ask "Is this chunk garbage?"
`ContextualSufficiency`	LLM	Check if retrieved chunks actually contain the answer

Response Quality (Is the answer correct?)¶

Metric	Type	Use When
`FactualAccuracy`	LLM	You have a Golden Answer
`Faithfulness`	LLM	RAG essential—checks hallucination
`AnswerCompleteness`	LLM	User asked 3 sub-questions—did agent answer all 3?
`AnswerCriteria`	LLM	You have business rules to verify

Agent Behavior (Is the interaction working?)¶

Metric	Type	Use When
`AnswerRelevancy`	LLM	"Evasion Detector"—penalizes off-topic answers
`ConversationEfficiency`	LLM	Penalizes loops (agent asks for same info twice)
`GoalCompletion`	LLM	Did user achieve what they came for?

Style & Trust¶

Metric	Type	Use When
`ToneStyleConsistency`	LLM	Does it match writing style of reference?
`CitationPresence`	Heuristic	Compliance—checks if citations exist
`CitationRelevancy`	LLM	Do citations actually support the claim?

LLM-as-a-Judge Best Practices¶

The Anti-Pattern¶

You are an evaluation expert. Judge the correctness based on:
1. Accuracy & Factual Correctness
2. Relevance & Completeness
3. Clarity & Coherence

Give me a score between 1 and 10 and a reason.

Why this fails:

What's the difference between a 7 and an 8?
Five dimensions collapsed into one number
Score before reasoning (LLMs reason token-by-token)
No examples to ground the model

The Better Pattern¶

Separation of Concerns: Split into two steps.

Extraction: Parse what was said (no judgment yet)
Verification: Boolean logic—"Does X exist in Y?"

Atomic Statements: Break compound sentences into individual facts.

# Instead of scoring "The product has feature A and feature B"
statements = [
    "The product has feature A.",      # Judge: 1 (True)
    "The product has feature B."       # Judge: 0 (False)
]
# Result: 50% accuracy (mathematical, not vibes)

Chain of Thought: Force analysis before the score.

Few-Shot Examples: Always provide examples of good and bad responses with correct scores.

Judge Model Selection¶

The teacher must be smarter than the student.

You cannot evaluate a GPT-4o agent using a GPT-3.5-Turbo judge. The judge must have reasoning capabilities equal to or greater than the agent being tested.

Key principle: "Cheap on generation, expensive on evaluation."

Summary¶

The goal is not to automate evaluation away — it's to build resilient, domain-aware evaluation systems that measure what truly matters and drive meaningful, measurable improvement. Start simple. Look at your data. Define "good." Then iterate.

Evaluation Flywheel Hierarchical Scoring