Evaluate

A guided five-step workflow for uploading evaluation data, reviewing scorecards, exploring deep-dive analytics, and comparing experiments side by side.

Why Use Evaluate?

Every LLM application needs a structured way to measure quality. The Evaluate module provides a complete pipeline — from importing raw results to interactive visualizations — so you can understand what your model gets right, where it struggles, and how changes compare across experiments.

🎯 Hierarchical Scorecards

Weighted, expandable metric trees with sparkline distributions and drill-down modals for every score.

🔍 Deep Dive Analysis

Seven sub-tabs covering distributions, tradeoffs, tree views, response analysis, conversation turns, and metadata breakdowns.

🔄 Experiment Comparison

Head-to-head experiment analysis with win/loss charts, radar overlays, and per-case diff views.

📋 Report Generation

Auto-generate evaluation reports from the scorecard — scoped to the full hierarchy or any individual sub-tree.

Quick Start

Get your first evaluation scorecard in under two minutes:

Navigate to Evaluate

Click Evaluate in the left sidebar. The step navigation bar appears at the top, starting at the Runner (or Upload if Runner is disabled).

Upload Your CSV

Click the Upload step. Drag a CSV into the upload zone or click to browse. You can also load an example dataset to explore the UI immediately.

Review the Scorecard

After upload, click Continue to Scorecard (or click the Scorecard step). The KPI cards, hierarchical table, and report generator appear instantly.

Explore & Compare

Use Deep Dive for per-metric charts and analysis, or Compare to pit two experiments against each other.

💡 Tip

If you don't have evaluation data yet, click Example Model or Model Comparison on the upload page to load a pre-built dataset and explore every feature.

The Evaluate section uses a horizontal step indicator that guides you through the workflow. Each step is a separate route under /evaluate/*.

localhost:3500/evaluate/scorecard

Evaluate

Run evaluations, import results, and visualize with interactive charts

🔒

Runner

Run batch evaluations

⬆

Upload

Import evaluation results

🏆

Scorecard

Executive summary view

🔎

Deep Dive

Detailed analysis

🔀

Compare

Compare experiments

The five-step navigation bar. Scorecard is active (primary color), Upload is completed (pale), Runner is disabled (lock icon), and Deep Dive / Compare are ahead (gray).

Runner (Disabled) — Batch evaluation runner. When disabled by an admin via the eval_runner_enabled feature flag, shows a lock icon and a "Disabled by Admin" tooltip on hover.

Upload — Import CSV evaluation data via drag-and-drop or load an example dataset. Once data is loaded, this step turns pale to indicate completion.

Scorecard — Executive summary: KPI cards, hierarchical scorecard table with weights and sparklines, drill-down modals, and a report generator.

Deep Dive — Seven interactive sub-tabs for granular analysis: Overview, Distribution, Tradeoffs, Tree, Response, Conversation, and Metadata.

Compare — Multi-experiment comparison with two view modes (Test Cases and Experiments), head-to-head summaries, and four chart types.

Step state logic:

Active step — primary background with white text and icon
Past step — pale primary background with primary text (indicates you've been further ahead)
Future step — gray background with muted text
Connectors — the short lines between steps turn primary when a step is completed
Disabled — grayed out at 50% opacity with a lock icon; non-clickable

Upload

The Upload step (/evaluate/upload) accepts CSV evaluation data via a drag-and-drop zone. AXIS auto-detects the data format and validates the schema on the backend.

localhost:3500/evaluate/upload

Upload Evaluation Data

Import your CSV file or try an example dataset

📁 File Upload

🗄 Database

Drop your CSV file here

or click to browse · CSV files up to 50 MB

Or try an example dataset:

📊 Example Model

🔄 Model Comparison

Data Loaded Successfully

Format: tree_format • 2,450 records

Continue to Scorecard →

dataset_id	query	actual_output	metric_name	metric_score
case_001	What is the return policy...	Our return policy allows...	Faithfulness	0.91
case_001	What is the return policy...	Our return policy allows...	Relevance	0.87
case_002	Explain the pricing tiers...	We offer three tiers...	Completeness	0.54

Upload page showing the drag-and-drop zone, example datasets, and the success state with format badge, row count, and data preview table.

Drag-and-Drop Zone — Drop a CSV or click to browse. Shows a highlight on hover. Supports files up to 50 MB.

Example Datasets — Pre-built sample data. "Example Model" loads a single-experiment dataset; "Model Comparison" loads multi-experiment data for the Compare step.

Success State — After upload, shows the detected format (tree_format, flat_format, simple_judgment), total record count, and a preview table of the first rows.

ℹ️ Info

AXIS auto-detects the data format. Tree format and flat format use columns metric_name and metric_score (one row per metric observation). Simple judgment uses columns ending in _score (one row per test case with all scores as columns). The Scorecard and Report Generator require tree/flat format for hierarchical features.

Scorecard

The Scorecard step (/evaluate/scorecard) provides an executive summary of your evaluation results. It combines KPI cards, a hierarchical metric table with weights and sparklines, drill-down modals, and a report generator.

KPI Cards

Five cards at the top give you an at-a-glance summary:

Scorecard KPIs + Hierarchical Table

Scorecard

Executive summary with hierarchical metrics

Overall Weighted Score

0.782

Weighted avg across hierarchy

Score Variance

0.034

Consistency measure

Test Cases

245

Unique evaluations

Metrics

metric_type = metric

Components

metric_type = component

Report Generator — Generate a comprehensive evaluation report

Generate Full Report

▼ Expand All

▶ Collapse All

8 metrics · 3 components

Metric	Type	Weight	Avg Score	Range
▼ Overall Quality	component	100%	0.782	0.34 – 0.97	📋
▶ Faithfulness	metric	35%	0.847	0.42 – 0.97	👁
▶ Relevance	metric	30%	0.815	0.38 – 0.95	👁
▶ Completeness	metric	20%	0.623	0.21 – 0.88	👁
▶ Safety	metric	15%	0.934	0.71 – 1.00	👁

Scorecard view showing five KPI cards, the report generator panel, and the hierarchical metric table with expandable tree rows, weights, score badges, and distribution sparklines.

KPI Cards — Five metrics: Overall Weighted Score (weighted average across the full hierarchy), Score Variance (consistency measure), Test Cases, Metrics count, and Components count.

Report Generator — Click to generate a comprehensive evaluation report. Can also scope reports to individual sub-trees via the report icon on each table row.

Hierarchical Table — Expandable tree structure showing metrics nested under components. Each row shows type badge, normalized weight, average score with color coding, score range, and a sparkline distribution histogram. Click the eye icon to open the drill-down modal.

Hierarchical Table Columns

Column	Description
Metric	Tree-indented name with expand/collapse toggles. Components show at the top level; metrics nest underneath.
Type	component (parent aggregate) or metric (leaf evaluation metric).
Weight	Normalized percentage weight. Auto-computed so all sibling weights sum to 100% within each parent.
Avg Score	Color-coded badge: ≥ 0.7, 0.5–0.7, < 0.5.
Range	Min–Max scores observed across all test cases for this metric.
Distribution	Inline sparkline histogram showing how scores are distributed.

📝 Note

The table supports sorting by name, score, or weight via the column header arrows. Click Expand All / Collapse All to toggle the full tree at once.

Drill-Down Modal

Click the eye icon on any scorecard row to open a detailed view of that metric — showing the full score distribution chart, per-test-case breakdown, and the failing cases that need attention.

Deep Dive

The Deep Dive step (/evaluate/deep-dive) provides seven interactive sub-tabs for granular analysis. Each tab focuses on a different aspect of your evaluation data.

localhost:3500/evaluate/deep-dive?tab=overview

Deep Dive

Detailed analysis and interactive visualizations

Overview

Distribution

Tradeoffs

Tree

Response

Conversation

Metadata

Average Score

0.782

▲ Above threshold

Pass Rate

76.3%

at ≥ 0.5 threshold

Test Cases

245

Unique evaluations

Score Variance

0.034

Cross-metric consistency

Mean Scores by Metric

Metric Radar

Deep Dive overview showing the seven sub-tabs, KPI summary cards, mean scores bar chart, and a radar chart for multi-metric comparison.

Sub-Tab Navigation — Seven pills in a rounded bar. The active tab gets primary background. Tab state persists in Zustand and syncs to the URL query parameter (?tab=overview).

Sub-Tab Reference

Tab	Icon	Shows
Overview	BarChart3	KPI cards (avg score, pass rate, test cases, variance) + bar chart of mean scores + radar chart
Distribution	PieChart	Violin or box charts per metric. Toggle between violin and box mode. Multi-select metrics to compare.
Tradeoffs	Workflow	Scatter plot of metric A vs metric B. Color-by grouping (experiment, metadata fields). Correlation coefficients.
Tree	GitBranch	D3-powered hierarchical tree visualization of the metric hierarchy, matching the scorecard structure.
Response	FileText	Response length analysis — how output length correlates with evaluation scores.
Conversation	MessageSquare	Turn count analysis — how multi-turn conversation depth affects evaluation quality.
Metadata	Tag	Score breakdowns by metadata fields (experiment name, model, category, or any custom metadata key).

Distribution Tab

Shows violin or box plots for each selected metric. Use the metric selector to include or exclude specific metrics from the chart.

Deep Dive — Distribution

Violin

Box

Faithfulness

Relevance

Completeness

Safety

Distribution tab showing violin plots for three selected metrics. The interquartile box and median line are visible within each violin shape.

Violin mode shows the full probability density — wider areas indicate more data points at that score level
Box mode switches to traditional box plots with quartiles, whiskers, and outlier dots
Metric pills at the top-right toggle which metrics appear in the chart

Tradeoffs Tab

Scatter plot with two metric axes — useful for understanding metric correlations. For example, plot Faithfulness vs Completeness to see if improving one hurts the other.

💡 Tip

Use the Color By dropdown to color scatter points by experiment name or metadata fields. This reveals whether different experiments cluster in distinct regions of the metric space.

Tree Tab

An interactive D3 force-directed tree visualization showing the full metric hierarchy from the scorecard. Nodes are sized by weight and colored by score.

Response & Conversation Tabs

These tabs analyze how structural properties of the input/output relate to evaluation quality:

Response tab — correlates response length (character/token count) with metric scores
Conversation tab — correlates conversation turn count with metric scores (useful for multi-turn chat evaluations)

Metadata Tab

Breaks down scores by any metadata field in your data. Select a field (e.g., evaluation_name, model, or custom fields from the metadata JSON column) to see grouped bar charts and summary tables.

Compare

The Compare step (/evaluate/compare) enables side-by-side analysis of multiple experiments. It supports two view modes: Test Cases (per-case comparison) and Experiments (head-to-head aggregate analysis).

localhost:3500/evaluate/compare

Compare

Analyze and compare evaluation results

Test Cases

Experiments

Baseline

gpt-4o-experiment 245 cases

⇄

Challenger

claude-3.5-experiment 245 cases

Baseline Wins

127

Ties

Challenger Wins

Comparison Charts

▲

Win/Loss

Distribution

Radar

Bar

Compare view in Experiments mode showing the experiment selector, head-to-head win summary, and a Win/Loss stacked bar chart with four chart type tabs.

View Mode Toggle — Switch between Test Cases (per-case side-by-side table with diff view) and Experiments (aggregate head-to-head comparison with charts).

Experiment Selector — Pick a baseline and challenger from your experiments. The swap button (⇄) reverses them. Auto-selects if exactly 2 experiments exist.

Head-to-Head Summary — Aggregate win/tie/loss counts across all test cases and metrics. Shows which experiment outperforms overall.

Comparison Charts — Collapsible section with four chart types: Win/Loss (stacked horizontal bars per metric), Distribution (score density comparison), Radar (multi-metric overlay), and Bar (grouped mean scores).

Chart Types

Chart	What It Shows	Best For
Win/Loss	Stacked horizontal bars per metric: baseline wins (left), ties (center), challenger wins (right)	Quick per-metric dominance comparison
Distribution	Overlaid score density curves for baseline vs challenger on a single metric	Understanding where score distributions differ
Radar	Multi-metric spider chart with one polygon per experiment	Holistic multi-dimensional comparison
Bar	Grouped bar chart of mean scores per metric per experiment	Direct numeric comparison across metrics

Test Cases View

In Test Cases mode, the Compare step shows a per-case table with side-by-side outputs and per-metric score deltas. Features include:

Side-by-side table with baseline and challenger columns
Score delta badges — green when challenger improves, red when it regresses
Case diff view — inline text diff highlighting output differences
Test case detail modal — full view of a single case with all fields and metric scores
Export — download the full comparison as a CSV report

⚠️ Warning

The Compare step requires your data to contain at least two distinct values in the evaluation_name column. If your data has only one experiment, the Experiments view will show a prompt to upload multi-experiment data.

Data Formats

AXIS auto-detects your CSV format on upload. Three formats are supported for the Evaluate module:

Tree / Flat Format (Recommended)

Each row is a single metric observation. Required for the Scorecard's hierarchical features and the Report Generator.

dataset_id,query,actual_output,metric_name,metric_score,metric_type,evaluation_name,metadata
case_001,What is...,The answer...,Faithfulness,0.85,metric,gpt-4o-run,"{""model"":""gpt-4o""}"
case_001,What is...,The answer...,Overall Quality,0.82,component,gpt-4o-run,"{}"

Simple Judgment Format

Each row contains all metrics as columns (column names ending in _score). Simpler but does not support hierarchical scorecard features.

dataset_id,query,actual_output,faithfulness_score,relevance_score
case_001,What is...,The answer...,0.85,0.92

ℹ️ Info

The metric_type column distinguishes component (parent aggregate) from metric (leaf evaluation score). The scorecard uses this to build the hierarchy. If metric_type is missing, all metrics are treated as flat leaves.

Evaluate

Why Use Evaluate?

🎯 Hierarchical Scorecards

🔍 Deep Dive Analysis

🔄 Experiment Comparison

📋 Report Generation

Quick Start

Navigate to Evaluate

Upload Your CSV

Review the Scorecard

Explore & Compare

Step Navigation

Evaluate

Upload

Scorecard

KPI Cards

Hierarchical Table Columns

Drill-Down Modal

Deep Dive

Mean Scores by Metric

Metric Radar

Sub-Tab Reference

Distribution Tab

Tradeoffs Tab

Tree Tab

Response & Conversation Tabs

Metadata Tab

Compare

Chart Types

Test Cases View

Data Formats

Tree / Flat Format (Recommended)

Simple Judgment Format

Next Steps

📊 Monitoring →

✏ Annotation Studio →

⚖ Calibration →