Monitoring

Real-time observability for production AI systems — track scores, latency, classifications, and anomalies across every deployment.

Why Use Monitoring?

Your evaluation metrics don't stop mattering after deployment. The Monitoring module gives you continuous visibility into how your AI agents and models perform in production, so you can catch regressions before users do.

📈 Score Trends

Track faithfulness, relevance, and custom metrics over time with configurable granularity.

⏱ Latency Tracking

Monitor P50/P95/P99 response times grouped by environment, source, or component.

📊 Classification Breakdowns

See how categorical metrics shift over time — topic drift, sentiment changes, intent distribution.

🚨 Anomaly Alerts

Automatic z-score, moving average, and threshold-based alerting with zero configuration.

Quick Start

Get monitoring data flowing in under two minutes:

Navigate to Monitor

Click Monitor in the left sidebar. If no data is loaded, you'll see the upload screen.

Upload CSV or Connect Database

Drag a CSV file into the upload zone, or configure a DuckDB/Postgres connection in Settings → Database for automatic sync.

Explore Your Metrics

Once data loads, the dashboard populates with KPI cards, trend charts, and trace tables. Use the filter bar to slice by environment, source, or time range.

💡 Tip

For production deployments, use the PostgreSQL auto-sync. It runs incremental updates automatically so your dashboard always shows the latest data without manual uploads.

Page Anatomy

Here's how the Monitoring dashboard is organized, with every major section labeled:

localhost:3500/monitoring

Monitor

Real-time performance monitoring and alerts

alpha_bot

All Environments ▾

All Components ▾

All Types ▾

Last 24 hours

↻

Avg Score

0.847

▲ Above threshold

Pass Rate

89.2%

▲ Above target

P95 Latency

2.34s

P50: 0.82s · P99: 4.1s

Active Alerts

▼ 2 errors, 1 warning

Score Metrics

Classification

Analysis Insights

Alerts 3

Score Trend

Hourly

Daily

Weekly

The Monitoring dashboard showing the full page anatomy — source selector, filters, KPIs, metric tabs, and score trend chart.

Source Selector — Choose which data source to analyze. Shows all connected sources (CSV uploads, database tables).

Metadata Filters — Narrow data by environment, source_component, and source_type. Only visible when the data contains those fields.

Time Range Picker — Presets (1h, 6h, 24h, 7d, 30d) or a custom date range. The reset button next to it clears all active filters.

KPI Cards — Pre-aggregated metrics that load instantly: average score, pass rate, latency percentiles, and active alert count.

Metric Category Tabs — Switch between Score Metrics, Classification, Analysis Insights, and Alerts. Tabs appear dynamically based on what metric types exist in your data.

Score Metrics Tab

The default tab. Shows numeric evaluation scores (0–1) for metrics like Faithfulness, Relevance, Completeness, and any custom score metrics in your data.

Score Trend Chart

A full-width time-series chart plotting each metric as a separate line with P50–P95 band shading. Two dashed reference lines mark your configured thresholds:

Good threshold (default 0.7) — green dashed line
Pass threshold (default 0.5) — yellow dashed line

ℹ️ Info

Thresholds are configurable per source in Settings → Database → Thresholds. Per-source overrides take priority over global defaults.

Controls:

Granularity toggle (top-right of chart) — switch between Hourly, Daily, and Weekly aggregation
Hover any data point to see metric name, timestamp, avg/P50/P95/P99 scores, and sample count
The legend at the bottom is interactive — click a metric name to show/hide its line

Latency & Metric Breakdown Charts

Below the trend chart, two charts sit side by side:

Latency Distribution + Metric Pass Rates

Latency Distribution

By Source ▾

Metric Pass Rates

Left: latency histogram with P50/P95/P99 percentile markers. Right: horizontal bar chart showing per-metric pass rates with color-coded thresholds.

Latency Distribution

Histogram showing how response times are distributed
Vertical dashed lines mark P50, P95, and P99 percentiles
Use the Group By dropdown (top-right) to split by Environment, Source, Component, or Type

Metric Pass Rates

Horizontal bar chart — one bar per metric
Color-coded: ≥ 70% green, 50–70% yellow, < 50% red
Reference lines at the 50% and 70% marks
Hover any bar for the exact pass rate, average score, and sample count

Top Failing Outputs

Below the charts, a table surfaces the 15 lowest-scoring traces across all metrics — your worst performers at a glance.

Top Failing Outputs

Threshold:

0.5 ▾

Metric	Trace ID	Timestamp	Input	LLM Output
Faithfulness 0.12	01KFX...a3b	Feb 14, 14:23	What are the refund policies for...	I'm not sure about that, but I think...	👁
Relevance 0.18	01KFX...d7e	Feb 14, 13:41	Can you explain how to configure...	The weather today is sunny with a...	👁
Completeness 0.31	01KFX...f2c	Feb 14, 12:09	List all the integration options...	We support Slack integration...	👁

Top failing outputs table. Adjust the threshold dropdown to change which records surface. Click the eye icon to open the detail modal.

The Threshold dropdown (top-right) controls the cutoff — options: 0.3, 0.5, 0.7, 0.8, 0.9
Each row shows a colored indicator: red for scores below half the threshold, yellow otherwise
Click the eye icon on any row to open the Trace Detail Modal

Evaluation Traces Table

The full paginated trace table appears below, separated by an "ALL TRACES" divider.

Evaluation Traces

Evaluation Traces 12,847 records

Search traces...

All Metrics ▾

↕ Recent

‹ 1 / 1285 ›

Trace ID	Timestamp ↕	Model	Env	Latency	Metric	Score
01KFX...c9a	Feb 14, 15:02	gpt-4o	production	0.84s	Faithfulness	0.92	👁
01KFX...c9a	Feb 14, 15:02	gpt-4o	production	0.84s	Relevance	0.88	👁
01KFX...b4f	Feb 14, 14:58	claude-3.5	staging	1.23s	Faithfulness	0.61	👁
01KFX...a1d	Feb 14, 14:55	gpt-4o	production	2.41s	Completeness	0.34	👁

Full trace table with search, metric filter, sort controls, and pagination. Each row is one metric observation on one trace.

Controls in the header bar:

Search — filters by trace ID, query text, or output text (debounced 300ms)
Metric filter — show only a specific metric (e.g., just "Faithfulness")
Sort — toggle between "Recent" (timestamp) and "Score"; click the arrow to flip ascending/descending
Pagination — 10 records per page, server-side when using DuckDB

📝 Note

In long format data, each row is one metric observation (a trace appears multiple times, once per metric). In wide format, each row is one trace with all scores as columns.

Classification Tab

Visible when your data includes metrics with metric_category: CLASSIFICATION (e.g., intent detection, topic classification, sentiment). This tab shows how categorical values are distributed and how they shift over time.

Classification Tab

Explanation

Output

Classification Summary

Intent ▾

Total Records

4,231

Unique Categories

Category Distribution

Category Trends

Percentage

Count

Classification tab showing category summary, distribution chart, and stacked area trends with percentage/count toggle.

Key interactions:

Source toggle (top) — switch between grouping by Explanation or Output field
Metric selector — choose which classification metric to analyze
Trends toggle — switch the Y-axis between percentage (relative) and count (absolute)

Analysis Insights Tab

For metrics with metric_category: ANALYSIS — structured JSON evaluations like detailed reasoning, multi-signal assessments, or chain-of-thought outputs. Instead of charts, this tab shows an expandable card feed.

Analysis Insights Tab

Recent Analysis Records

All Metrics ▾

2,341 records Page 1 of 235

▼ Safety Check 01KFX...a4b What medications can I take with... 📍 production/alpha_bot 🕒 Feb 14, 14:23

QUERY

What medications can I take with aspirin?

EXPLANATION

The response correctly identified drug interaction risks and recommended consulting a healthcare professional. No harmful medical advice was provided.

SIGNALS

✓ safe_referral ✓ no_direct_medical_advice ✓ disclaimer_present

▶ Tone Check 01KFX...b8c I'm really frustrated with your service... 📍 production/alpha_bot 🕒 Feb 14, 14:18

▶ Safety Check 01KFX...c2d How do I reset my password? 📍 staging/beta_bot 🕒 Feb 14, 14:12

Analysis Insights showing expandable cards with query, explanation, and structured signals. Click a card header to expand/collapse.

Cards are paginated (10 per page) with controls in the header
The first card auto-expands on load
Click any card header to expand it and see the full query, output, explanation, and signals
Signals render as colored badges: ✓ pass or ✗ fail

Alerts Tab

AXIS automatically detects anomalies using multiple detection methods — no configuration required. When issues are found, the Alerts tab shows a badge count on the tab header.

Alerts Tab

2 errors 1 warning

▼ Threshold Alerts 2

Faithfulness score dropped below threshold ERROR threshold

Current0.42

Target0.70

Gap-0.28

P95 latency exceeds SLA ERROR threshold

Current4.8s

Target3.0s

Gap+1.8s

▼ Anomaly Detection 1

Unusual score variance on Relevance WARNING z-score

Current0.58

Z-Score2.4

Threshold2.0

Alerts tab showing threshold violations and anomaly detection results, grouped by type with severity indicators and structured value blocks.

Alert types and their value blocks:

Detection Method	Shows	Description
threshold	Current, Target, Gap	Value crossed a configured threshold
z-score	Current, Z-Score, Threshold	Statistical outlier (default: \|z\| > 2.0)
moving-average	Current, MA, Deviation	Deviated from rolling average
rate-of-change	Current, Previous, Change	Sudden metric shift between periods

💡 Tip

When there are no alerts, the tab displays a green "All systems nominal" confirmation. The badge count disappears from the tab header.

Click the eye icon on any trace row to open a full detail view. The modal shows everything about a single evaluation record in one place.

Trace Detail Modal

Trace detail modal showing score, input/output side by side, explanation, signal verdicts, and metadata footer.

Modal sections (top to bottom):

Score header — large colored score circle with progress bar and metric name
Trace ID banner — full ID with a copy button
Input / Output — side-by-side panels showing the query and LLM response
Expected Output — the ground truth, if present in your data
Explanation — blue-highlighted evaluator reasoning
Critique — yellow-highlighted critique notes, if present
Signals — colored pass/fail badges from structured evaluation outputs
Additional Metadata — any extra fields, rendered as key-value pairs or collapsible JSON
Footer — model, environment, latency, timestamp at a glance

Filtering & Time Ranges

Filters apply globally across all tabs. When you change a filter, the KPI cards, charts, tables, and alerts all update together.

Metadata Filters

The filter dropdowns at the top appear conditionally — only if your data contains the corresponding field:

Filter	Data Field	Example Values
Environment	`environment`	production, staging, development
Component	`source_component`	retriever, generator, router
Type	`source_type`	langfuse, custom, opentelemetry

⚠️ Warning

Filtering is by source metadata, not by model name. To compare models, add a source_component or environment column to your data that identifies the model.

Time Range Presets

The time range picker offers quick presets plus a custom date range mode:

Last hour · Last 6 hours · Last 24 hours (default) · Last 7 days · Last 30 days
Custom range — opens two date inputs; click Apply to confirm

The reset button (↻) clears all metadata filters and resets the time range back to "Last 24 hours".

Data Formats

AXIS auto-detects your CSV format on upload. Two formats are supported:

Long Format (Recommended)

Each row is a single metric observation. Best for production pipelines that emit normalized metric data.

dataset_id,query,actual_output,metric_name,metric_score,timestamp,environment,source_name
01KFX...,What is...,The answer...,Faithfulness,0.85,2024-01-15T10:30:00,production,alpha_bot
01KFX...,What is...,The answer...,Relevance,0.92,2024-01-15T10:30:00,production,alpha_bot

Wide Format

Each row contains all metrics as columns. Columns ending in _score are auto-detected as metrics.

dataset_id,query,actual_output,faithfulness_score,relevance_score,timestamp
01KFX...,What is...,The answer...,0.85,0.92,2024-01-15T10:30:00

💡 Tip

Long format gives you richer per-metric metadata and works better with the Classification and Analysis tabs. Use wide format only for quick ad-hoc uploads.

Data Pipeline (DuckDB)

For production use, connect AXIS to your PostgreSQL database. The sync engine handles everything automatically:

Concurrent Read

The sync engine runs dataset_query and results_query concurrently against your Postgres database.

Write to DuckDB

Data is written to internal DuckDB tables (monitoring_dataset and monitoring_results).

Join & Serve

A JOIN view (monitoring_data) combines both tables. All /api/monitoring/analytics/* endpoints query DuckDB directly for fast aggregation.

Incremental Sync

Periodic sync keeps DuckDB up-to-date without restarts. KPIs use a metadata cache fast path — pre-aggregated during sync and served instantly.

Monitoring

Why Use Monitoring?

📈 Score Trends

⏱ Latency Tracking

📊 Classification Breakdowns

🚨 Anomaly Alerts

Quick Start

Navigate to Monitor

Upload CSV or Connect Database

Explore Your Metrics

Page Anatomy

Monitor

Score Trend

Score Metrics Tab

Score Trend Chart

Latency & Metric Breakdown Charts

Latency Distribution

Metric Pass Rates

Latency Distribution

Metric Pass Rates

Top Failing Outputs

Top Failing Outputs

Evaluation Traces Table

Evaluation Traces 12,847 records

Classification Tab

Classification Summary

Category Distribution

Category Trends

Analysis Insights Tab

Recent Analysis Records

Alerts Tab

Trace Detail Modal

Filtering & Time Ranges

Metadata Filters

Time Range Presets

Data Formats

Long Format (Recommended)

Wide Format

Data Pipeline (DuckDB)

Concurrent Read

Write to DuckDB

Join & Serve

Incremental Sync

Next Steps

🛡 Production →

📡 Human Signals →

⚙ Configuration →