Learn

Interactive guides and best practices for AI evaluation — from fundamental concepts to advanced methodologies.

Why Use the Learn Page?

Before diving into dashboards and metrics, you need to understand what you're measuring and why. The Learn page is your onboarding companion and ongoing reference for evaluation concepts.

📖 Foundational Concepts

Understand AI evaluation fundamentals: accuracy, consistency, safety, and alignment.

🎯 Interactive Walkthroughs

Step through real evaluation scenarios with animated data flow diagrams.

Method Comparison

Compare LLM-as-Judge, human evaluation, automated metrics, and hybrid approaches.

🛡 Best Practices

Avoid common pitfalls and build a robust, repeatable evaluation strategy.

Quick Start

The Learn page is divided into four tabs. Here's the recommended learning path:

1

Start with the Overview

Read through the foundational concepts. Understand what a dataset item looks like and how the evaluation pipeline flows from input to score.

2

Try the Interactive Walkthrough

Select a scenario (Single Turn, Multi-Turn, RAG, etc.) and step through the evaluation flow. Watch data transform at each stage of the pipeline.

3

Study the Methods

Compare the four evaluation approaches. Understand the tradeoffs between cost, quality, and scalability for your use case.

4

Review Best Practices

Internalize the do's and don'ts, learn about common pitfalls like position bias and overfitting to benchmarks, and build a sustainable evaluation cadence.

💡 Tip
New to AXIS? Start with the Overview tab, then jump straight to the Evaluate page to upload your first dataset. Come back to Methods and Best Practices as you refine your evaluation strategy.

Page Anatomy

Here's how the Learn page is organized. Four tabs provide progressively deeper content:

localhost:3500/learn

Learn

Interactive guides and best practices for AI evaluation

1
Overview
Interactive Walkthrough
Methods
Best Practices
2

Welcome to AI Evaluation

Learn how to systematically assess and improve your AI systems. This guide covers the fundamentals of evaluation, from understanding data structures to implementing best practices.

3
Do
Use diverse test cases covering edge cases
Don't
Rely solely on automated metrics
💡
Tip
Calibrate LLM judges monthly
4

What is AI Evaluation?

Understanding the purpose and importance of systematic AI assessment

Accuracy
Correct outputs?
Consistency
Reliable across inputs?
Safety
Avoids harmful outputs?
Alignment
Behaves as intended?
5
The Learn page showing the tab navigation bar, hero section, quick tips row, and concept cards in the Overview tab.
1
Page Header — Shows the page title, icon, and subtitle. Consistent across all AXIS pages.
2
Tab Navigation — Four tabs: Overview, Interactive Walkthrough, Methods, and Best Practices. The active tab is highlighted with the primary color.
3
Hero Section — Gradient card introducing the current tab's topic with a decorative icon.
4
Quick Tips Row — Color-coded Do/Don't/Tip cards for at-a-glance best practices.
5
Concept Cards — Expandable content cards covering evaluation fundamentals, data structure, and the pipeline flow.

Overview Tab

The default landing tab. It introduces three fundamental concepts every user needs before starting their first evaluation.

What is AI Evaluation?

The first concept card explains the four dimensions of evaluation:

  • Accuracy — Does the model produce correct outputs?
  • Consistency — Are outputs reliable across similar inputs?
  • Safety — Does the model avoid harmful outputs?
  • Alignment — Does the model behave as intended?

Dataset Item Model

The second card shows the anatomy of a single evaluation record. Understanding this structure is essential for formatting your upload data correctly.

Dataset Item Model
Dataset Item Structure
Input
query, context, system_prompt
Output
actual_output, expected_output
Scores
metric_name, metric_score (0–1)
Metadata
dataset_id, timestamp, model, env
The dataset item model showing the four data groups: input fields, output fields, score fields, and metadata.

The Evaluation Pipeline

The third concept card visualizes the end-to-end flow:

  1. Prepare Test Data — Structure your queries and expected outputs into a dataset
  2. Generate Responses — Run your AI model against the test queries
  3. Evaluate Quality — Use judges (LLM, human, or automated) to score responses
  4. Analyze Results — Review metrics, identify patterns, and iterate on improvements
ℹ️ Info
At the bottom of the Overview tab, a call-to-action card invites you to try the Interactive Walkthrough to see this pipeline in action with real data examples.

Interactive Walkthrough Tab

A step-by-step guided experience that shows how evaluation data flows through the pipeline. Choose a scenario and watch each stage unfold.

Scenario Selection

Five evaluation scenarios are available, each highlighting different aspects of the evaluation process:

Walkthrough Scenario Selection

Choose an Evaluation Scenario

Select a scenario to see how data flows through the evaluation pipeline.

Single Turn
Basic query-response evaluation
With Expected Output
Compare against reference
Multi-Turn
Conversation context
RAG Evaluation
Retrieval-augmented
Full Workflow
End-to-end evaluation
Step 1 of 4: Input Data
◀ Prev
Next ▶
Input
AI
Judge
Score
Example Input
"What are the return policies for electronics purchased online?"
The Interactive Walkthrough showing scenario tabs, a step-by-step viewer with flow diagram, playback controls, and example data display.

Key interactions:

  • Scenario tabs — Click any scenario card to switch the walkthrough content. The active scenario has a purple border and tinted background.
  • Step navigation — Use the Prev/Next buttons or the step indicator to move through pipeline stages
  • Flow diagram — The active stage is highlighted with a colored border; future stages are grayed out
  • Example data — Each step shows real data examples relevant to that stage (input, model output, judge reasoning, final scores)
ScenarioStepsFocus
Single Turn4Basic query → response → judge → score
With Expected Output5Adds reference comparison (expected vs actual)
Multi-Turn5Conversation context and message history
RAG Evaluation6Retrieval context, faithfulness, relevance
Full Workflow7Complete pipeline from data prep to analysis

Methods Tab

A comprehensive reference for the four primary evaluation approaches. Each method is presented as an expandable card with pros, cons, and ideal use cases.

Method Cards

Four evaluation methods are presented with expandable detail panels:

Methods Tab - Evaluation Methods

Evaluation Methods

There are several approaches to evaluating AI systems, each with different tradeoffs.

LLM-as-Judge

Use a large language model to evaluate responses with scores and explanations.

Medium Complexity High Scalability
Advantages
  • + Scalable to thousands of evaluations
  • + Can capture nuance and context
  • + Provides detailed explanations
  • Limitations
  • - Requires calibration against humans
  • - May have position/verbosity bias
  • - API costs can add up at scale
  • Human Evaluation

    Expert human annotators review and score AI outputs. The gold standard.

    Low Complexity Low Scalability

    Automated Metrics

    BLEU, ROUGE, exact match, or embedding similarity for programmatic evaluation.

    Low Complexity High Scalability

    Hybrid Approach

    Combine multiple evaluation methods for balanced cost, quality, and scale.

    High Complexity High Scalability
    Methods tab showing four expandable method cards with complexity/scalability badges. The LLM-as-Judge card is expanded to show pros and cons.

    Comparison Table

    Below the method cards, a comparison matrix lets you evaluate methods side-by-side across six key attributes:

    AttributeLLM JudgeHumanAutomatedHybrid
    Scalable to large datasets~
    Captures nuance
    Consistent results~~
    Low cost per evaluation~~
    No reference needed
    Explainable scores
    💡 Tip
    For most teams, AXIS recommends starting with LLM-as-Judge for bulk evaluation, calibrating against human evaluation monthly, and using automated metrics as CI/CD quality gates.

    Best Practices Tab

    Actionable guidance organized into do's/don'ts, collapsible deep-dive sections, and a common pitfalls gallery.

    Do's and Don'ts Panel

    A quick-reference panel at the top splits into two color-coded columns:

    Best Practices - Do's and Don'ts

    ✓ Do

    • + Use diverse test cases covering edge cases
    • + Document evaluation criteria clearly
    • + Calibrate LLM judges regularly
    • + Track metrics over time for trends
    • + Include positive and negative examples
    • + Version datasets alongside models

    ✗ Don't

    • - Rely solely on automated metrics
    • - Use same data for training and eval
    • - Ignore edge cases for common ones
    • - Assume LLM judges are always correct
    • - Evaluate only at release time
    • - Discard results without analysis
    The Do's and Don'ts quick-reference panel with color-coded columns for easy scanning.

    Detailed Best Practices

    Below the quick reference, five collapsible sections provide in-depth guidance:

    SectionIconContent
    Define Clear Criteria🎯Good vs. vague criteria examples, measurability tips
    Use Representative Data📦Data coverage checklist (8 items), sampling strategies
    Calibrate Judges Regularly5-step calibration process, agreement metrics (Cohen's Kappa)
    Iterate Continuously🔄Weekly/monthly/quarterly cadence recommendations
    Document Everything📝Five documentation categories with checklists

    Common Pitfalls

    A gallery of four pitfall cards, each showing the mistake, its consequence, and the recommended solution:

    Common Pitfalls Gallery

    Common Pitfalls

    ⚠ Overfitting to Benchmarks
    Mistake: Optimizing your model specifically for evaluation benchmarks
    ✓ Use held-out test sets and periodically refresh evaluation data
    ⚠ Position Bias
    Mistake: Not accounting for LLMs preferring responses in certain positions
    ✓ Randomize response order and average across orderings
    ⚠ Verbosity Bias
    Mistake: Not controlling for response length in evaluations
    ✓ Normalize for length or instruct judges to ignore length
    ⚠ Self-Preference Bias
    Mistake: Using the same model family for generation and evaluation
    ✓ Use judges from different model families
    Common pitfall cards showing mistake, consequence, and solution. Each card has a yellow left-accent border for visual distinction.
    ⚠️ Warning
    Position bias and verbosity bias are the two most common issues when using LLM-as-Judge. Always randomize response order and control for length before trusting comparison results.

    Call to Action

    At the bottom of every tab, a gradient call-to-action card encourages the user to take the next step. On the Overview tab, it links to the Interactive Walkthrough. On the Best Practices tab, it summarizes key takeaways with a shield icon.

    ℹ️ Info
    The CTA card at the very bottom of the page (below all tabs) links to the Evaluate page with a "Get Started" button, encouraging hands-on learning immediately after reading.

    Next Steps

    AXIS Documentation · Built with MkDocs Material