Evaluate
A guided five-step workflow for uploading evaluation data, reviewing scorecards, exploring deep-dive analytics, and comparing experiments side by side.
Why Use Evaluate?
Every LLM application needs a structured way to measure quality. The Evaluate module provides a complete pipeline — from importing raw results to interactive visualizations — so you can understand what your model gets right, where it struggles, and how changes compare across experiments.
Hierarchical Scorecards
Weighted, expandable metric trees with sparkline distributions and drill-down modals for every score.
Deep Dive Analysis
Seven sub-tabs covering distributions, tradeoffs, tree views, response analysis, conversation turns, and metadata breakdowns.
Experiment Comparison
Head-to-head experiment analysis with win/loss charts, radar overlays, and per-case diff views.
Report Generation
Auto-generate evaluation reports from the scorecard — scoped to the full hierarchy or any individual sub-tree.
Quick Start
Get your first evaluation scorecard in under two minutes:
Navigate to Evaluate
Click Evaluate in the left sidebar. The step navigation bar appears at the top, starting at the Runner (or Upload if Runner is disabled).
Upload Your CSV
Click the Upload step. Drag a CSV into the upload zone or click to browse. You can also load an example dataset to explore the UI immediately.
Review the Scorecard
After upload, click Continue to Scorecard (or click the Scorecard step). The KPI cards, hierarchical table, and report generator appear instantly.
Explore & Compare
Use Deep Dive for per-metric charts and analysis, or Compare to pit two experiments against each other.
Step Navigation
The Evaluate section uses a horizontal step indicator that guides you through the workflow. Each step is a separate route under /evaluate/*.
Evaluate
Run evaluations, import results, and visualize with interactive charts
eval_runner_enabled feature flag, shows a lock icon and a "Disabled by Admin" tooltip on hover.Step state logic:
- Active step — primary background with white text and icon
- Past step — pale primary background with primary text (indicates you've been further ahead)
- Future step — gray background with muted text
- Connectors — the short lines between steps turn primary when a step is completed
- Disabled — grayed out at 50% opacity with a lock icon; non-clickable
Upload
The Upload step (/evaluate/upload) accepts CSV evaluation data via a drag-and-drop zone. AXIS auto-detects the data format and validates the schema on the backend.
tree_format, flat_format, simple_judgment), total record count, and a preview table of the first rows.metric_name and metric_score (one row per metric observation). Simple judgment uses columns ending in _score (one row per test case with all scores as columns). The Scorecard and Report Generator require tree/flat format for hierarchical features.
Scorecard
The Scorecard step (/evaluate/scorecard) provides an executive summary of your evaluation results. It combines KPI cards, a hierarchical metric table with weights and sparklines, drill-down modals, and a report generator.
KPI Cards
Five cards at the top give you an at-a-glance summary:
Hierarchical Table Columns
| Column | Description |
|---|---|
| Metric | Tree-indented name with expand/collapse toggles. Components show at the top level; metrics nest underneath. |
| Type | component (parent aggregate) or metric (leaf evaluation metric). |
| Weight | Normalized percentage weight. Auto-computed so all sibling weights sum to 100% within each parent. |
| Avg Score | Color-coded badge: ≥ 0.7, 0.5–0.7, < 0.5. |
| Range | Min–Max scores observed across all test cases for this metric. |
| Distribution | Inline sparkline histogram showing how scores are distributed. |
Drill-Down Modal
Click the eye icon on any scorecard row to open a detailed view of that metric — showing the full score distribution chart, per-test-case breakdown, and the failing cases that need attention.
Deep Dive
The Deep Dive step (/evaluate/deep-dive) provides seven interactive sub-tabs for granular analysis. Each tab focuses on a different aspect of your evaluation data.
Mean Scores by Metric
Metric Radar
?tab=overview).Sub-Tab Reference
| Tab | Icon | Shows |
|---|---|---|
| Overview | BarChart3 | KPI cards (avg score, pass rate, test cases, variance) + bar chart of mean scores + radar chart |
| Distribution | PieChart | Violin or box charts per metric. Toggle between violin and box mode. Multi-select metrics to compare. |
| Tradeoffs | Workflow | Scatter plot of metric A vs metric B. Color-by grouping (experiment, metadata fields). Correlation coefficients. |
| Tree | GitBranch | D3-powered hierarchical tree visualization of the metric hierarchy, matching the scorecard structure. |
| Response | FileText | Response length analysis — how output length correlates with evaluation scores. |
| Conversation | MessageSquare | Turn count analysis — how multi-turn conversation depth affects evaluation quality. |
| Metadata | Tag | Score breakdowns by metadata fields (experiment name, model, category, or any custom metadata key). |
Distribution Tab
Shows violin or box plots for each selected metric. Use the metric selector to include or exclude specific metrics from the chart.
- Violin mode shows the full probability density — wider areas indicate more data points at that score level
- Box mode switches to traditional box plots with quartiles, whiskers, and outlier dots
- Metric pills at the top-right toggle which metrics appear in the chart
Tradeoffs Tab
Scatter plot with two metric axes — useful for understanding metric correlations. For example, plot Faithfulness vs Completeness to see if improving one hurts the other.
Tree Tab
An interactive D3 force-directed tree visualization showing the full metric hierarchy from the scorecard. Nodes are sized by weight and colored by score.
Response & Conversation Tabs
These tabs analyze how structural properties of the input/output relate to evaluation quality:
- Response tab — correlates response length (character/token count) with metric scores
- Conversation tab — correlates conversation turn count with metric scores (useful for multi-turn chat evaluations)
Metadata Tab
Breaks down scores by any metadata field in your data. Select a field (e.g., evaluation_name, model, or custom fields from the metadata JSON column) to see grouped bar charts and summary tables.
Compare
The Compare step (/evaluate/compare) enables side-by-side analysis of multiple experiments. It supports two view modes: Test Cases (per-case comparison) and Experiments (head-to-head aggregate analysis).
Chart Types
| Chart | What It Shows | Best For |
|---|---|---|
| Win/Loss | Stacked horizontal bars per metric: baseline wins (left), ties (center), challenger wins (right) | Quick per-metric dominance comparison |
| Distribution | Overlaid score density curves for baseline vs challenger on a single metric | Understanding where score distributions differ |
| Radar | Multi-metric spider chart with one polygon per experiment | Holistic multi-dimensional comparison |
| Bar | Grouped bar chart of mean scores per metric per experiment | Direct numeric comparison across metrics |
Test Cases View
In Test Cases mode, the Compare step shows a per-case table with side-by-side outputs and per-metric score deltas. Features include:
- Side-by-side table with baseline and challenger columns
- Score delta badges — green when challenger improves, red when it regresses
- Case diff view — inline text diff highlighting output differences
- Test case detail modal — full view of a single case with all fields and metric scores
- Export — download the full comparison as a CSV report
evaluation_name column. If your data has only one experiment, the Experiments view will show a prompt to upload multi-experiment data.
Data Formats
AXIS auto-detects your CSV format on upload. Three formats are supported for the Evaluate module:
Tree / Flat Format (Recommended)
Each row is a single metric observation. Required for the Scorecard's hierarchical features and the Report Generator.
dataset_id,query,actual_output,metric_name,metric_score,metric_type,evaluation_name,metadata
case_001,What is...,The answer...,Faithfulness,0.85,metric,gpt-4o-run,"{""model"":""gpt-4o""}"
case_001,What is...,The answer...,Overall Quality,0.82,component,gpt-4o-run,"{}"
Simple Judgment Format
Each row contains all metrics as columns (column names ending in _score). Simpler but does not support hierarchical scorecard features.
dataset_id,query,actual_output,faithfulness_score,relevance_score
case_001,What is...,The answer...,0.85,0.92
metric_type column distinguishes component (parent aggregate) from metric (leaf evaluation score). The scorecard uses this to build the hierarchy. If metric_type is missing, all metrics are treated as flat leaves.