Learn

Interactive guides and best practices for AI evaluation — from fundamental concepts to advanced methodologies.

Why Use the Learn Page?

Before diving into dashboards and metrics, you need to understand what you're measuring and why. The Learn page is your onboarding companion and ongoing reference for evaluation concepts.

📖 Foundational Concepts

Understand AI evaluation fundamentals: accuracy, consistency, safety, and alignment.

🎯 Interactive Walkthroughs

Step through real evaluation scenarios with animated data flow diagrams.

⚖ Method Comparison

Compare LLM-as-Judge, human evaluation, automated metrics, and hybrid approaches.

🛡 Best Practices

Avoid common pitfalls and build a robust, repeatable evaluation strategy.

Quick Start

The Learn page is divided into four tabs. Here's the recommended learning path:

Start with the Overview

Read through the foundational concepts. Understand what a dataset item looks like and how the evaluation pipeline flows from input to score.

Try the Interactive Walkthrough

Select a scenario (Single Turn, Multi-Turn, RAG, etc.) and step through the evaluation flow. Watch data transform at each stage of the pipeline.

Study the Methods

Compare the four evaluation approaches. Understand the tradeoffs between cost, quality, and scalability for your use case.

Review Best Practices

Internalize the do's and don'ts, learn about common pitfalls like position bias and overfitting to benchmarks, and build a sustainable evaluation cadence.

💡 Tip

New to AXIS? Start with the Overview tab, then jump straight to the Evaluate page to upload your first dataset. Come back to Methods and Best Practices as you refine your evaluation strategy.

Page Anatomy

Here's how the Learn page is organized. Four tabs provide progressively deeper content:

localhost:3500/learn

Learn

Interactive guides and best practices for AI evaluation

Overview

Interactive Walkthrough

Methods

Best Practices

Welcome to AI Evaluation

Learn how to systematically assess and improve your AI systems. This guide covers the fundamentals of evaluation, from understanding data structures to implementing best practices.

✓

Use diverse test cases covering edge cases

⚠

Don't

Rely solely on automated metrics

💡

Tip

Calibrate LLM judges monthly

What is AI Evaluation?

Understanding the purpose and importance of systematic AI assessment

Accuracy

Correct outputs?

Consistency

Reliable across inputs?

Safety

Avoids harmful outputs?

Alignment

Behaves as intended?

The Learn page showing the tab navigation bar, hero section, quick tips row, and concept cards in the Overview tab.

Page Header — Shows the page title, icon, and subtitle. Consistent across all AXIS pages.

Tab Navigation — Four tabs: Overview, Interactive Walkthrough, Methods, and Best Practices. The active tab is highlighted with the primary color.

Hero Section — Gradient card introducing the current tab's topic with a decorative icon.

Quick Tips Row — Color-coded Do/Don't/Tip cards for at-a-glance best practices.

Concept Cards — Expandable content cards covering evaluation fundamentals, data structure, and the pipeline flow.

Overview Tab

The default landing tab. It introduces three fundamental concepts every user needs before starting their first evaluation.

What is AI Evaluation?

The first concept card explains the four dimensions of evaluation:

Accuracy — Does the model produce correct outputs?
Consistency — Are outputs reliable across similar inputs?
Safety — Does the model avoid harmful outputs?
Alignment — Does the model behave as intended?

Dataset Item Model

The second card shows the anatomy of a single evaluation record. Understanding this structure is essential for formatting your upload data correctly.

Dataset Item Model

Dataset Item Structure

Input

query, context, system_prompt

Output

actual_output, expected_output

Scores

metric_name, metric_score (0–1)

Metadata

dataset_id, timestamp, model, env

The dataset item model showing the four data groups: input fields, output fields, score fields, and metadata.

The Evaluation Pipeline

The third concept card visualizes the end-to-end flow:

Prepare Test Data — Structure your queries and expected outputs into a dataset
Generate Responses — Run your AI model against the test queries
Evaluate Quality — Use judges (LLM, human, or automated) to score responses
Analyze Results — Review metrics, identify patterns, and iterate on improvements

ℹ️ Info

At the bottom of the Overview tab, a call-to-action card invites you to try the Interactive Walkthrough to see this pipeline in action with real data examples.

Interactive Walkthrough Tab

A step-by-step guided experience that shows how evaluation data flows through the pipeline. Choose a scenario and watch each stage unfold.

Scenario Selection

Five evaluation scenarios are available, each highlighting different aspects of the evaluation process:

Walkthrough Scenario Selection

Choose an Evaluation Scenario

Select a scenario to see how data flows through the evaluation pipeline.

Single Turn

Basic query-response evaluation

With Expected Output

Compare against reference

Multi-Turn

Conversation context

RAG Evaluation

Retrieval-augmented

Full Workflow

End-to-end evaluation

Step 1 of 4: Input Data

◀ Prev

Next ▶

Input

→

Judge

→

Score

Example Input

              "What are the return policies for electronics purchased online?"
            

The Interactive Walkthrough showing scenario tabs, a step-by-step viewer with flow diagram, playback controls, and example data display.

Key interactions:

Scenario tabs — Click any scenario card to switch the walkthrough content. The active scenario has a purple border and tinted background.
Step navigation — Use the Prev/Next buttons or the step indicator to move through pipeline stages
Flow diagram — The active stage is highlighted with a colored border; future stages are grayed out
Example data — Each step shows real data examples relevant to that stage (input, model output, judge reasoning, final scores)

Scenario	Steps	Focus
Single Turn	4	Basic query → response → judge → score
With Expected Output	5	Adds reference comparison (expected vs actual)
Multi-Turn	5	Conversation context and message history
RAG Evaluation	6	Retrieval context, faithfulness, relevance
Full Workflow	7	Complete pipeline from data prep to analysis

Methods Tab

A comprehensive reference for the four primary evaluation approaches. Each method is presented as an expandable card with pros, cons, and ideal use cases.

Method Cards

Four evaluation methods are presented with expandable detail panels:

Methods Tab - Evaluation Methods

Evaluation Methods

There are several approaches to evaluating AI systems, each with different tradeoffs.

LLM-as-Judge

▼

Use a large language model to evaluate responses with scores and explanations.

Medium Complexity High Scalability

Advantages

+ Scalable to thousands of evaluations

+ Can capture nuance and context

+ Provides detailed explanations

Limitations

- Requires calibration against humans

- May have position/verbosity bias

- API costs can add up at scale

Human Evaluation

▶

Expert human annotators review and score AI outputs. The gold standard.

Low Complexity Low Scalability

Automated Metrics

▶

BLEU, ROUGE, exact match, or embedding similarity for programmatic evaluation.

Low Complexity High Scalability

Hybrid Approach

▶

Combine multiple evaluation methods for balanced cost, quality, and scale.

High Complexity High Scalability

Methods tab showing four expandable method cards with complexity/scalability badges. The LLM-as-Judge card is expanded to show pros and cons.

Comparison Table

Below the method cards, a comparison matrix lets you evaluate methods side-by-side across six key attributes:

Attribute	LLM Judge	Human	Automated	Hybrid
Scalable to large datasets	✓	✗	✓	~
Captures nuance	✓	✓	✗	✓
Consistent results	~	✗	✓	~
Low cost per evaluation	~	✗	✓	~
No reference needed	✓	✓	✗	✓
Explainable scores	✓	✓	✗	✓

💡 Tip

For most teams, AXIS recommends starting with LLM-as-Judge for bulk evaluation, calibrating against human evaluation monthly, and using automated metrics as CI/CD quality gates.

Best Practices Tab

Actionable guidance organized into do's/don'ts, collapsible deep-dive sections, and a common pitfalls gallery.

Do's and Don'ts Panel

A quick-reference panel at the top splits into two color-coded columns:

Best Practices - Do's and Don'ts

✓ Do

+ Use diverse test cases covering edge cases
+ Document evaluation criteria clearly
+ Calibrate LLM judges regularly
+ Track metrics over time for trends
+ Include positive and negative examples
+ Version datasets alongside models

✗ Don't

- Rely solely on automated metrics
- Use same data for training and eval
- Ignore edge cases for common ones
- Assume LLM judges are always correct
- Evaluate only at release time
- Discard results without analysis

The Do's and Don'ts quick-reference panel with color-coded columns for easy scanning.

Detailed Best Practices

Below the quick reference, five collapsible sections provide in-depth guidance:

Section	Icon	Content
Define Clear Criteria	🎯	Good vs. vague criteria examples, measurability tips
Use Representative Data	📦	Data coverage checklist (8 items), sampling strategies
Calibrate Judges Regularly	⚖	5-step calibration process, agreement metrics (Cohen's Kappa)
Iterate Continuously	🔄	Weekly/monthly/quarterly cadence recommendations
Document Everything	📝	Five documentation categories with checklists

Common Pitfalls

A gallery of four pitfall cards, each showing the mistake, its consequence, and the recommended solution:

Common Pitfalls Gallery

⚠

Common Pitfalls

⚠ Overfitting to Benchmarks

Mistake: Optimizing your model specifically for evaluation benchmarks

✓ Use held-out test sets and periodically refresh evaluation data

⚠ Position Bias

Mistake: Not accounting for LLMs preferring responses in certain positions

✓ Randomize response order and average across orderings

⚠ Verbosity Bias

Mistake: Not controlling for response length in evaluations

✓ Normalize for length or instruct judges to ignore length

⚠ Self-Preference Bias

Mistake: Using the same model family for generation and evaluation

✓ Use judges from different model families

Common pitfall cards showing mistake, consequence, and solution. Each card has a yellow left-accent border for visual distinction.

⚠️ Warning

Position bias and verbosity bias are the two most common issues when using LLM-as-Judge. Always randomize response order and control for length before trusting comparison results.

Call to Action

At the bottom of every tab, a gradient call-to-action card encourages the user to take the next step. On the Overview tab, it links to the Interactive Walkthrough. On the Best Practices tab, it summarizes key takeaways with a shield icon.

ℹ️ Info

The CTA card at the very bottom of the page (below all tabs) links to the Evaluate page with a "Get Started" button, encouraging hands-on learning immediately after reading.

Learn

Why Use the Learn Page?

📖 Foundational Concepts

🎯 Interactive Walkthroughs

⚖ Method Comparison

🛡 Best Practices

Quick Start

Start with the Overview

Try the Interactive Walkthrough

Study the Methods

Review Best Practices

Page Anatomy

Learn

Welcome to AI Evaluation

What is AI Evaluation?

Overview Tab

What is AI Evaluation?

Dataset Item Model

The Evaluation Pipeline

Interactive Walkthrough Tab

Scenario Selection

Choose an Evaluation Scenario

Methods Tab

Method Cards

Evaluation Methods

LLM-as-Judge

Advantages

Limitations

Human Evaluation

Automated Metrics

Hybrid Approach

Comparison Table

Best Practices Tab

Do's and Don'ts Panel

✓ Do

✗ Don't

Detailed Best Practices

Common Pitfalls

Common Pitfalls

⚠ Overfitting to Benchmarks

⚠ Position Bias

⚠ Verbosity Bias

⚠ Self-Preference Bias

Call to Action

Next Steps

📊 Evaluate →

🔍 Calibration →

📈 Monitoring →