Evaluate

A guided five-step workflow for uploading evaluation data, reviewing scorecards, exploring deep-dive analytics, and comparing experiments side by side.

Why Use Evaluate?

Every LLM application needs a structured way to measure quality. The Evaluate module provides a complete pipeline — from importing raw results to interactive visualizations — so you can understand what your model gets right, where it struggles, and how changes compare across experiments.

🎯 Hierarchical Scorecards

Weighted, expandable metric trees with sparkline distributions and drill-down modals for every score.

🔍 Deep Dive Analysis

Seven sub-tabs covering distributions, tradeoffs, tree views, response analysis, conversation turns, and metadata breakdowns.

🔄 Experiment Comparison

Head-to-head experiment analysis with win/loss charts, radar overlays, and per-case diff views.

📋 Report Generation

Auto-generate evaluation reports from the scorecard — scoped to the full hierarchy or any individual sub-tree.

Quick Start

Get your first evaluation scorecard in under two minutes:

1

Navigate to Evaluate

Click Evaluate in the left sidebar. The step navigation bar appears at the top, starting at the Runner (or Upload if Runner is disabled).

2

Upload Your CSV

Click the Upload step. Drag a CSV into the upload zone or click to browse. You can also load an example dataset to explore the UI immediately.

3

Review the Scorecard

After upload, click Continue to Scorecard (or click the Scorecard step). The KPI cards, hierarchical table, and report generator appear instantly.

4

Explore & Compare

Use Deep Dive for per-metric charts and analysis, or Compare to pit two experiments against each other.

💡 Tip
If you don't have evaluation data yet, click Example Model or Model Comparison on the upload page to load a pre-built dataset and explore every feature.

Step Navigation

The Evaluate section uses a horizontal step indicator that guides you through the workflow. Each step is a separate route under /evaluate/*.

localhost:3500/evaluate/scorecard

Evaluate

Run evaluations, import results, and visualize with interactive charts

🔒
Runner
Run batch evaluations
Upload
Import evaluation results
🏆
Scorecard
Executive summary view
🔎
Deep Dive
Detailed analysis
🔀
Compare
Compare experiments
1
2
3
4
5
The five-step navigation bar. Scorecard is active (primary color), Upload is completed (pale), Runner is disabled (lock icon), and Deep Dive / Compare are ahead (gray).
1
Runner (Disabled) — Batch evaluation runner. When disabled by an admin via the eval_runner_enabled feature flag, shows a lock icon and a "Disabled by Admin" tooltip on hover.
2
Upload — Import CSV evaluation data via drag-and-drop or load an example dataset. Once data is loaded, this step turns pale to indicate completion.
3
Scorecard — Executive summary: KPI cards, hierarchical scorecard table with weights and sparklines, drill-down modals, and a report generator.
4
Deep Dive — Seven interactive sub-tabs for granular analysis: Overview, Distribution, Tradeoffs, Tree, Response, Conversation, and Metadata.
5
Compare — Multi-experiment comparison with two view modes (Test Cases and Experiments), head-to-head summaries, and four chart types.

Step state logic:

  • Active step — primary background with white text and icon
  • Past step — pale primary background with primary text (indicates you've been further ahead)
  • Future step — gray background with muted text
  • Connectors — the short lines between steps turn primary when a step is completed
  • Disabled — grayed out at 50% opacity with a lock icon; non-clickable

Upload

The Upload step (/evaluate/upload) accepts CSV evaluation data via a drag-and-drop zone. AXIS auto-detects the data format and validates the schema on the backend.

localhost:3500/evaluate/upload
Upload Evaluation Data
Import your CSV file or try an example dataset
📁 File Upload
🗄 Database
Drop your CSV file here
or click to browse · CSV files up to 50 MB
1
Or try an example dataset:
📊 Example Model
🔄 Model Comparison
2
Data Loaded Successfully
Format: tree_format • 2,450 records
Continue to Scorecard →
3
dataset_id query actual_output metric_name metric_score
case_001 What is the return policy... Our return policy allows... Faithfulness 0.91
case_001 What is the return policy... Our return policy allows... Relevance 0.87
case_002 Explain the pricing tiers... We offer three tiers... Completeness 0.54
Upload page showing the drag-and-drop zone, example datasets, and the success state with format badge, row count, and data preview table.
1
Drag-and-Drop Zone — Drop a CSV or click to browse. Shows a highlight on hover. Supports files up to 50 MB.
2
Example Datasets — Pre-built sample data. "Example Model" loads a single-experiment dataset; "Model Comparison" loads multi-experiment data for the Compare step.
3
Success State — After upload, shows the detected format (tree_format, flat_format, simple_judgment), total record count, and a preview table of the first rows.
ℹ️ Info
AXIS auto-detects the data format. Tree format and flat format use columns metric_name and metric_score (one row per metric observation). Simple judgment uses columns ending in _score (one row per test case with all scores as columns). The Scorecard and Report Generator require tree/flat format for hierarchical features.

Scorecard

The Scorecard step (/evaluate/scorecard) provides an executive summary of your evaluation results. It combines KPI cards, a hierarchical metric table with weights and sparklines, drill-down modals, and a report generator.

KPI Cards

Five cards at the top give you an at-a-glance summary:

Scorecard KPIs + Hierarchical Table
Scorecard
Executive summary with hierarchical metrics
Overall Weighted Score
0.782
Weighted avg across hierarchy
Score Variance
0.034
Consistency measure
Test Cases
245
Unique evaluations
Metrics
8
metric_type = metric
Components
3
metric_type = component
1
Report Generator — Generate a comprehensive evaluation report
Generate Full Report
2
▼ Expand All
▶ Collapse All
8 metrics · 3 components
Metric Type Weight Avg Score Range Distribution
Overall Quality
component 100% 0.782 0.34 – 0.97 📋
Faithfulness
metric 35% 0.847 0.42 – 0.97 👁
Relevance
metric 30% 0.815 0.38 – 0.95 👁
Completeness
metric 20% 0.623 0.21 – 0.88 👁
Safety
metric 15% 0.934 0.71 – 1.00 👁
3
Scorecard view showing five KPI cards, the report generator panel, and the hierarchical metric table with expandable tree rows, weights, score badges, and distribution sparklines.
1
KPI Cards — Five metrics: Overall Weighted Score (weighted average across the full hierarchy), Score Variance (consistency measure), Test Cases, Metrics count, and Components count.
2
Report Generator — Click to generate a comprehensive evaluation report. Can also scope reports to individual sub-trees via the report icon on each table row.
3
Hierarchical Table — Expandable tree structure showing metrics nested under components. Each row shows type badge, normalized weight, average score with color coding, score range, and a sparkline distribution histogram. Click the eye icon to open the drill-down modal.

Hierarchical Table Columns

ColumnDescription
MetricTree-indented name with expand/collapse toggles. Components show at the top level; metrics nest underneath.
Typecomponent (parent aggregate) or metric (leaf evaluation metric).
WeightNormalized percentage weight. Auto-computed so all sibling weights sum to 100% within each parent.
Avg ScoreColor-coded badge: ≥ 0.7, 0.5–0.7, < 0.5.
RangeMin–Max scores observed across all test cases for this metric.
DistributionInline sparkline histogram showing how scores are distributed.
📝 Note
The table supports sorting by name, score, or weight via the column header arrows. Click Expand All / Collapse All to toggle the full tree at once.

Drill-Down Modal

Click the eye icon on any scorecard row to open a detailed view of that metric — showing the full score distribution chart, per-test-case breakdown, and the failing cases that need attention.

Deep Dive

The Deep Dive step (/evaluate/deep-dive) provides seven interactive sub-tabs for granular analysis. Each tab focuses on a different aspect of your evaluation data.

localhost:3500/evaluate/deep-dive?tab=overview
Deep Dive
Detailed analysis and interactive visualizations
Overview
Distribution
Tradeoffs
Tree
Response
Conversation
Metadata
1
Average Score
0.782
▲ Above threshold
Pass Rate
76.3%
at ≥ 0.5 threshold
Test Cases
245
Unique evaluations
Score Variance
0.034
Cross-metric consistency

Mean Scores by Metric

Faithfulness 0.85 Relevance 0.82 Completeness 0.62 Safety 0.93 0.5 0.7

Metric Radar

Faithfulness Relevance Completeness Safety Tone
Deep Dive overview showing the seven sub-tabs, KPI summary cards, mean scores bar chart, and a radar chart for multi-metric comparison.
1
Sub-Tab Navigation — Seven pills in a rounded bar. The active tab gets primary background. Tab state persists in Zustand and syncs to the URL query parameter (?tab=overview).

Sub-Tab Reference

TabIconShows
OverviewBarChart3KPI cards (avg score, pass rate, test cases, variance) + bar chart of mean scores + radar chart
DistributionPieChartViolin or box charts per metric. Toggle between violin and box mode. Multi-select metrics to compare.
TradeoffsWorkflowScatter plot of metric A vs metric B. Color-by grouping (experiment, metadata fields). Correlation coefficients.
TreeGitBranchD3-powered hierarchical tree visualization of the metric hierarchy, matching the scorecard structure.
ResponseFileTextResponse length analysis — how output length correlates with evaluation scores.
ConversationMessageSquareTurn count analysis — how multi-turn conversation depth affects evaluation quality.
MetadataTagScore breakdowns by metadata fields (experiment name, model, category, or any custom metadata key).

Distribution Tab

Shows violin or box plots for each selected metric. Use the metric selector to include or exclude specific metrics from the chart.

Deep Dive — Distribution
Violin
Box
Faithfulness
Relevance
Completeness
Safety
1.0 0.75 0.50 0.25 0.0 Faithfulness Relevance Completeness
Distribution tab showing violin plots for three selected metrics. The interquartile box and median line are visible within each violin shape.
  • Violin mode shows the full probability density — wider areas indicate more data points at that score level
  • Box mode switches to traditional box plots with quartiles, whiskers, and outlier dots
  • Metric pills at the top-right toggle which metrics appear in the chart

Tradeoffs Tab

Scatter plot with two metric axes — useful for understanding metric correlations. For example, plot Faithfulness vs Completeness to see if improving one hurts the other.

💡 Tip
Use the Color By dropdown to color scatter points by experiment name or metadata fields. This reveals whether different experiments cluster in distinct regions of the metric space.

Tree Tab

An interactive D3 force-directed tree visualization showing the full metric hierarchy from the scorecard. Nodes are sized by weight and colored by score.

Response & Conversation Tabs

These tabs analyze how structural properties of the input/output relate to evaluation quality:

  • Response tab — correlates response length (character/token count) with metric scores
  • Conversation tab — correlates conversation turn count with metric scores (useful for multi-turn chat evaluations)

Metadata Tab

Breaks down scores by any metadata field in your data. Select a field (e.g., evaluation_name, model, or custom fields from the metadata JSON column) to see grouped bar charts and summary tables.

Compare

The Compare step (/evaluate/compare) enables side-by-side analysis of multiple experiments. It supports two view modes: Test Cases (per-case comparison) and Experiments (head-to-head aggregate analysis).

localhost:3500/evaluate/compare
Compare
Analyze and compare evaluation results
Test Cases
Experiments
1
Baseline
gpt-4o-experiment 245 cases
Challenger
claude-3.5-experiment 245 cases
2
Baseline Wins
127
Ties
42
Challenger Wins
76
3
Comparison Charts
Win/Loss
Distribution
Radar
Bar
Faithfulness 61% 37% Relevance 53% 43% Completeness 45% 49% Safety 63% 29% Baseline wins Tie Challenger wins
4
Compare view in Experiments mode showing the experiment selector, head-to-head win summary, and a Win/Loss stacked bar chart with four chart type tabs.
1
View Mode Toggle — Switch between Test Cases (per-case side-by-side table with diff view) and Experiments (aggregate head-to-head comparison with charts).
2
Experiment Selector — Pick a baseline and challenger from your experiments. The swap button (⇄) reverses them. Auto-selects if exactly 2 experiments exist.
3
Head-to-Head Summary — Aggregate win/tie/loss counts across all test cases and metrics. Shows which experiment outperforms overall.
4
Comparison Charts — Collapsible section with four chart types: Win/Loss (stacked horizontal bars per metric), Distribution (score density comparison), Radar (multi-metric overlay), and Bar (grouped mean scores).

Chart Types

ChartWhat It ShowsBest For
Win/LossStacked horizontal bars per metric: baseline wins (left), ties (center), challenger wins (right)Quick per-metric dominance comparison
DistributionOverlaid score density curves for baseline vs challenger on a single metricUnderstanding where score distributions differ
RadarMulti-metric spider chart with one polygon per experimentHolistic multi-dimensional comparison
BarGrouped bar chart of mean scores per metric per experimentDirect numeric comparison across metrics

Test Cases View

In Test Cases mode, the Compare step shows a per-case table with side-by-side outputs and per-metric score deltas. Features include:

  • Side-by-side table with baseline and challenger columns
  • Score delta badges — green when challenger improves, red when it regresses
  • Case diff view — inline text diff highlighting output differences
  • Test case detail modal — full view of a single case with all fields and metric scores
  • Export — download the full comparison as a CSV report
⚠️ Warning
The Compare step requires your data to contain at least two distinct values in the evaluation_name column. If your data has only one experiment, the Experiments view will show a prompt to upload multi-experiment data.

Data Formats

AXIS auto-detects your CSV format on upload. Three formats are supported for the Evaluate module:

Tree / Flat Format (Recommended)

Each row is a single metric observation. Required for the Scorecard's hierarchical features and the Report Generator.

dataset_id,query,actual_output,metric_name,metric_score,metric_type,evaluation_name,metadata
case_001,What is...,The answer...,Faithfulness,0.85,metric,gpt-4o-run,"{""model"":""gpt-4o""}"
case_001,What is...,The answer...,Overall Quality,0.82,component,gpt-4o-run,"{}"

Simple Judgment Format

Each row contains all metrics as columns (column names ending in _score). Simpler but does not support hierarchical scorecard features.

dataset_id,query,actual_output,faithfulness_score,relevance_score
case_001,What is...,The answer...,0.85,0.92
ℹ️ Info
The metric_type column distinguishes component (parent aggregate) from metric (leaf evaluation score). The scorecard uses this to build the hierarchy. If metric_type is missing, all metrics are treated as flat leaves.

Next Steps

AXIS Documentation · Built with MkDocs Material