Learn
Interactive guides and best practices for AI evaluation — from fundamental concepts to advanced methodologies.
Why Use the Learn Page?
Before diving into dashboards and metrics, you need to understand what you're measuring and why. The Learn page is your onboarding companion and ongoing reference for evaluation concepts.
Foundational Concepts
Understand AI evaluation fundamentals: accuracy, consistency, safety, and alignment.
Interactive Walkthroughs
Step through real evaluation scenarios with animated data flow diagrams.
Method Comparison
Compare LLM-as-Judge, human evaluation, automated metrics, and hybrid approaches.
Best Practices
Avoid common pitfalls and build a robust, repeatable evaluation strategy.
Quick Start
The Learn page is divided into four tabs. Here's the recommended learning path:
Start with the Overview
Read through the foundational concepts. Understand what a dataset item looks like and how the evaluation pipeline flows from input to score.
Try the Interactive Walkthrough
Select a scenario (Single Turn, Multi-Turn, RAG, etc.) and step through the evaluation flow. Watch data transform at each stage of the pipeline.
Study the Methods
Compare the four evaluation approaches. Understand the tradeoffs between cost, quality, and scalability for your use case.
Review Best Practices
Internalize the do's and don'ts, learn about common pitfalls like position bias and overfitting to benchmarks, and build a sustainable evaluation cadence.
Page Anatomy
Here's how the Learn page is organized. Four tabs provide progressively deeper content:
Learn
Interactive guides and best practices for AI evaluation
Welcome to AI Evaluation
Learn how to systematically assess and improve your AI systems. This guide covers the fundamentals of evaluation, from understanding data structures to implementing best practices.
What is AI Evaluation?
Understanding the purpose and importance of systematic AI assessment
Overview Tab
The default landing tab. It introduces three fundamental concepts every user needs before starting their first evaluation.
What is AI Evaluation?
The first concept card explains the four dimensions of evaluation:
- Accuracy — Does the model produce correct outputs?
- Consistency — Are outputs reliable across similar inputs?
- Safety — Does the model avoid harmful outputs?
- Alignment — Does the model behave as intended?
Dataset Item Model
The second card shows the anatomy of a single evaluation record. Understanding this structure is essential for formatting your upload data correctly.
The Evaluation Pipeline
The third concept card visualizes the end-to-end flow:
- Prepare Test Data — Structure your queries and expected outputs into a dataset
- Generate Responses — Run your AI model against the test queries
- Evaluate Quality — Use judges (LLM, human, or automated) to score responses
- Analyze Results — Review metrics, identify patterns, and iterate on improvements
Interactive Walkthrough Tab
A step-by-step guided experience that shows how evaluation data flows through the pipeline. Choose a scenario and watch each stage unfold.
Scenario Selection
Five evaluation scenarios are available, each highlighting different aspects of the evaluation process:
Choose an Evaluation Scenario
Select a scenario to see how data flows through the evaluation pipeline.
Key interactions:
- Scenario tabs — Click any scenario card to switch the walkthrough content. The active scenario has a purple border and tinted background.
- Step navigation — Use the Prev/Next buttons or the step indicator to move through pipeline stages
- Flow diagram — The active stage is highlighted with a colored border; future stages are grayed out
- Example data — Each step shows real data examples relevant to that stage (input, model output, judge reasoning, final scores)
| Scenario | Steps | Focus |
|---|---|---|
| Single Turn | 4 | Basic query → response → judge → score |
| With Expected Output | 5 | Adds reference comparison (expected vs actual) |
| Multi-Turn | 5 | Conversation context and message history |
| RAG Evaluation | 6 | Retrieval context, faithfulness, relevance |
| Full Workflow | 7 | Complete pipeline from data prep to analysis |
Methods Tab
A comprehensive reference for the four primary evaluation approaches. Each method is presented as an expandable card with pros, cons, and ideal use cases.
Method Cards
Four evaluation methods are presented with expandable detail panels:
Evaluation Methods
There are several approaches to evaluating AI systems, each with different tradeoffs.
Human Evaluation
▶Expert human annotators review and score AI outputs. The gold standard.
Automated Metrics
▶BLEU, ROUGE, exact match, or embedding similarity for programmatic evaluation.
Hybrid Approach
▶Combine multiple evaluation methods for balanced cost, quality, and scale.
Comparison Table
Below the method cards, a comparison matrix lets you evaluate methods side-by-side across six key attributes:
| Attribute | LLM Judge | Human | Automated | Hybrid |
|---|---|---|---|---|
| Scalable to large datasets | ✓ | ✗ | ✓ | ~ |
| Captures nuance | ✓ | ✓ | ✗ | ✓ |
| Consistent results | ~ | ✗ | ✓ | ~ |
| Low cost per evaluation | ~ | ✗ | ✓ | ~ |
| No reference needed | ✓ | ✓ | ✗ | ✓ |
| Explainable scores | ✓ | ✓ | ✗ | ✓ |
Best Practices Tab
Actionable guidance organized into do's/don'ts, collapsible deep-dive sections, and a common pitfalls gallery.
Do's and Don'ts Panel
A quick-reference panel at the top splits into two color-coded columns:
✓ Do
- + Use diverse test cases covering edge cases
- + Document evaluation criteria clearly
- + Calibrate LLM judges regularly
- + Track metrics over time for trends
- + Include positive and negative examples
- + Version datasets alongside models
✗ Don't
- - Rely solely on automated metrics
- - Use same data for training and eval
- - Ignore edge cases for common ones
- - Assume LLM judges are always correct
- - Evaluate only at release time
- - Discard results without analysis
Detailed Best Practices
Below the quick reference, five collapsible sections provide in-depth guidance:
| Section | Icon | Content |
|---|---|---|
| Define Clear Criteria | 🎯 | Good vs. vague criteria examples, measurability tips |
| Use Representative Data | 📦 | Data coverage checklist (8 items), sampling strategies |
| Calibrate Judges Regularly | ⚖ | 5-step calibration process, agreement metrics (Cohen's Kappa) |
| Iterate Continuously | 🔄 | Weekly/monthly/quarterly cadence recommendations |
| Document Everything | 📝 | Five documentation categories with checklists |
Common Pitfalls
A gallery of four pitfall cards, each showing the mistake, its consequence, and the recommended solution:
Common Pitfalls
⚠ Overfitting to Benchmarks
⚠ Position Bias
⚠ Verbosity Bias
⚠ Self-Preference Bias
Call to Action
At the bottom of every tab, a gradient call-to-action card encourages the user to take the next step. On the Overview tab, it links to the Interactive Walkthrough. On the Best Practices tab, it summarizes key takeaways with a shield icon.