Monitoring

Real-time observability for production AI systems — track scores, latency, classifications, and anomalies across every deployment.

Why Use Monitoring?

Your evaluation metrics don't stop mattering after deployment. The Monitoring module gives you continuous visibility into how your AI agents and models perform in production, so you can catch regressions before users do.

📈 Score Trends

Track faithfulness, relevance, and custom metrics over time with configurable granularity.

Latency Tracking

Monitor P50/P95/P99 response times grouped by environment, source, or component.

📊 Classification Breakdowns

See how categorical metrics shift over time — topic drift, sentiment changes, intent distribution.

🚨 Anomaly Alerts

Automatic z-score, moving average, and threshold-based alerting with zero configuration.

Quick Start

Get monitoring data flowing in under two minutes:

1

Navigate to Monitor

Click Monitor in the left sidebar. If no data is loaded, you'll see the upload screen.

2

Upload CSV or Connect Database

Drag a CSV file into the upload zone, or configure a DuckDB/Postgres connection in Settings → Database for automatic sync.

3

Explore Your Metrics

Once data loads, the dashboard populates with KPI cards, trend charts, and trace tables. Use the filter bar to slice by environment, source, or time range.

💡 Tip
For production deployments, use the PostgreSQL auto-sync. It runs incremental updates automatically so your dashboard always shows the latest data without manual uploads.

Page Anatomy

Here's how the Monitoring dashboard is organized, with every major section labeled:

localhost:3500/monitoring

Monitor

Real-time performance monitoring and alerts

alpha_bot
1
All Environments ▾
All Components ▾
All Types ▾
Last 24 hours
2
3
Avg Score
0.847
▲ Above threshold
Pass Rate
89.2%
▲ Above target
P95 Latency
2.34s
P50: 0.82s · P99: 4.1s
Active Alerts
3
▼ 2 errors, 1 warning
4
Score Metrics
Classification
Analysis Insights
Alerts 3
5

Score Trend

Hourly
Daily
Weekly
Good (0.7) Pass (0.5) 1.0 0.8 0.6 0.4 0.0 Faithfulness Relevance Completeness
The Monitoring dashboard showing the full page anatomy — source selector, filters, KPIs, metric tabs, and score trend chart.
1
Source Selector — Choose which data source to analyze. Shows all connected sources (CSV uploads, database tables).
2
Metadata Filters — Narrow data by environment, source_component, and source_type. Only visible when the data contains those fields.
3
Time Range Picker — Presets (1h, 6h, 24h, 7d, 30d) or a custom date range. The reset button next to it clears all active filters.
4
KPI Cards — Pre-aggregated metrics that load instantly: average score, pass rate, latency percentiles, and active alert count.
5
Metric Category Tabs — Switch between Score Metrics, Classification, Analysis Insights, and Alerts. Tabs appear dynamically based on what metric types exist in your data.

Score Metrics Tab

The default tab. Shows numeric evaluation scores (0–1) for metrics like Faithfulness, Relevance, Completeness, and any custom score metrics in your data.

Score Trend Chart

A full-width time-series chart plotting each metric as a separate line with P50–P95 band shading. Two dashed reference lines mark your configured thresholds:

  • Good threshold (default 0.7) — green dashed line
  • Pass threshold (default 0.5) — yellow dashed line
ℹ️ Info
Thresholds are configurable per source in Settings → Database → Thresholds. Per-source overrides take priority over global defaults.

Controls:

  • Granularity toggle (top-right of chart) — switch between Hourly, Daily, and Weekly aggregation
  • Hover any data point to see metric name, timestamp, avg/P50/P95/P99 scores, and sample count
  • The legend at the bottom is interactive — click a metric name to show/hide its line

Latency & Metric Breakdown Charts

Below the trend chart, two charts sit side by side:

Latency Distribution + Metric Pass Rates

Latency Distribution

By Source ▾
P50 P95 P99 Latency (s)

Metric Pass Rates

Faithfulness 90% Relevance 85% Completeness 65% Toxicity 40% 50% 70%
Left: latency histogram with P50/P95/P99 percentile markers. Right: horizontal bar chart showing per-metric pass rates with color-coded thresholds.

Latency Distribution

  • Histogram showing how response times are distributed
  • Vertical dashed lines mark P50, P95, and P99 percentiles
  • Use the Group By dropdown (top-right) to split by Environment, Source, Component, or Type

Metric Pass Rates

  • Horizontal bar chart — one bar per metric
  • Color-coded: ≥ 70% green, 50–70% yellow, < 50% red
  • Reference lines at the 50% and 70% marks
  • Hover any bar for the exact pass rate, average score, and sample count

Top Failing Outputs

Below the charts, a table surfaces the 15 lowest-scoring traces across all metrics — your worst performers at a glance.

Top Failing Outputs

Top Failing Outputs

Threshold:
0.5 ▾
Metric Trace ID Timestamp Input LLM Output
Faithfulness 0.12 01KFX...a3b Feb 14, 14:23 What are the refund policies for... I'm not sure about that, but I think... 👁
Relevance 0.18 01KFX...d7e Feb 14, 13:41 Can you explain how to configure... The weather today is sunny with a... 👁
Completeness 0.31 01KFX...f2c Feb 14, 12:09 List all the integration options... We support Slack integration... 👁
Top failing outputs table. Adjust the threshold dropdown to change which records surface. Click the eye icon to open the detail modal.
  • The Threshold dropdown (top-right) controls the cutoff — options: 0.3, 0.5, 0.7, 0.8, 0.9
  • Each row shows a colored indicator: red for scores below half the threshold, yellow otherwise
  • Click the eye icon on any row to open the Trace Detail Modal

Evaluation Traces Table

The full paginated trace table appears below, separated by an "ALL TRACES" divider.

Evaluation Traces

Evaluation Traces 12,847 records

Search traces...
All Metrics ▾
↕ Recent
1 / 1285
Trace ID Timestamp ↕ Model Env Latency Metric Score
01KFX...c9a Feb 14, 15:02 gpt-4o production 0.84s Faithfulness 0.92 👁
01KFX...c9a Feb 14, 15:02 gpt-4o production 0.84s Relevance 0.88 👁
01KFX...b4f Feb 14, 14:58 claude-3.5 staging 1.23s Faithfulness 0.61 👁
01KFX...a1d Feb 14, 14:55 gpt-4o production 2.41s Completeness 0.34 👁
Full trace table with search, metric filter, sort controls, and pagination. Each row is one metric observation on one trace.

Controls in the header bar:

  • Search — filters by trace ID, query text, or output text (debounced 300ms)
  • Metric filter — show only a specific metric (e.g., just "Faithfulness")
  • Sort — toggle between "Recent" (timestamp) and "Score"; click the arrow to flip ascending/descending
  • Pagination — 10 records per page, server-side when using DuckDB
📝 Note
In long format data, each row is one metric observation (a trace appears multiple times, once per metric). In wide format, each row is one trace with all scores as columns.

Classification Tab

Visible when your data includes metrics with metric_category: CLASSIFICATION (e.g., intent detection, topic classification, sentiment). This tab shows how categorical values are distributed and how they shift over time.

Classification Tab
Explanation
Output

Classification Summary

Intent ▾
Total Records
4,231
Unique Categories
8

Category Distribution

billing_inquiry 34% tech_support 24% account_mgmt 18% general_info 10% other 14%

Category Trends

Percentage
Count
Time →
Classification tab showing category summary, distribution chart, and stacked area trends with percentage/count toggle.

Key interactions:

  • Source toggle (top) — switch between grouping by Explanation or Output field
  • Metric selector — choose which classification metric to analyze
  • Trends toggle — switch the Y-axis between percentage (relative) and count (absolute)

Analysis Insights Tab

For metrics with metric_category: ANALYSIS — structured JSON evaluations like detailed reasoning, multi-signal assessments, or chain-of-thought outputs. Instead of charts, this tab shows an expandable card feed.

Analysis Insights Tab

Recent Analysis Records

All Metrics ▾
2,341 records Page 1 of 235
Safety Check 01KFX...a4b What medications can I take with... 📍 production/alpha_bot 🕒 Feb 14, 14:23
QUERY
What medications can I take with aspirin?
EXPLANATION
The response correctly identified drug interaction risks and recommended consulting a healthcare professional. No harmful medical advice was provided.
SIGNALS
✓ safe_referral ✓ no_direct_medical_advice ✓ disclaimer_present
Tone Check 01KFX...b8c I'm really frustrated with your service... 📍 production/alpha_bot 🕒 Feb 14, 14:18
Safety Check 01KFX...c2d How do I reset my password? 📍 staging/beta_bot 🕒 Feb 14, 14:12
Analysis Insights showing expandable cards with query, explanation, and structured signals. Click a card header to expand/collapse.
  • Cards are paginated (10 per page) with controls in the header
  • The first card auto-expands on load
  • Click any card header to expand it and see the full query, output, explanation, and signals
  • Signals render as colored badges: ✓ pass or ✗ fail

Alerts Tab

AXIS automatically detects anomalies using multiple detection methods — no configuration required. When issues are found, the Alerts tab shows a badge count on the tab header.

Alerts Tab
2 errors 1 warning
Threshold Alerts 2
Faithfulness score dropped below threshold ERROR threshold
Current0.42
Target0.70
Gap-0.28
P95 latency exceeds SLA ERROR threshold
Current4.8s
Target3.0s
Gap+1.8s
Anomaly Detection 1
Unusual score variance on Relevance WARNING z-score
Current0.58
Z-Score2.4
Threshold2.0
Alerts tab showing threshold violations and anomaly detection results, grouped by type with severity indicators and structured value blocks.

Alert types and their value blocks:

Detection MethodShowsDescription
thresholdCurrent, Target, GapValue crossed a configured threshold
z-scoreCurrent, Z-Score, ThresholdStatistical outlier (default: |z| > 2.0)
moving-averageCurrent, MA, DeviationDeviated from rolling average
rate-of-changeCurrent, Previous, ChangeSudden metric shift between periods
💡 Tip
When there are no alerts, the tab displays a green "All systems nominal" confirmation. The badge count disappears from the tab header.

Trace Detail Modal

Click the eye icon on any trace row to open a full detail view. The modal shows everything about a single evaluation record in one place.

Trace Detail Modal
0.34
Completeness
×
🔗 01KFX-a1d2-b3c4-d5e6-f7890123 Copy
💬 Input (Query)
List all the integration options available for enterprise customers, including setup requirements.
🤖 LLM Output
We support Slack integration. You can set it up in the settings page.
💡 Explanation
The response only mentions Slack but fails to list other integrations (email, webhook, Zapier, API) or setup requirements. The answer is incomplete relative to the query scope.
⚡ Signals
✗ covers_all_options ✗ includes_setup_details ✓ factually_accurate ✓ professional_tone
Model
gpt-4o
Environment
production
Latency
2.41s
Timestamp
Feb 14, 14:55
Trace detail modal showing score, input/output side by side, explanation, signal verdicts, and metadata footer.

Modal sections (top to bottom):

  1. Score header — large colored score circle with progress bar and metric name
  2. Trace ID banner — full ID with a copy button
  3. Input / Output — side-by-side panels showing the query and LLM response
  4. Expected Output — the ground truth, if present in your data
  5. Explanation — blue-highlighted evaluator reasoning
  6. Critique — yellow-highlighted critique notes, if present
  7. Signals — colored pass/fail badges from structured evaluation outputs
  8. Additional Metadata — any extra fields, rendered as key-value pairs or collapsible JSON
  9. Footer — model, environment, latency, timestamp at a glance

Filtering & Time Ranges

Filters apply globally across all tabs. When you change a filter, the KPI cards, charts, tables, and alerts all update together.

Metadata Filters

The filter dropdowns at the top appear conditionally — only if your data contains the corresponding field:

FilterData FieldExample Values
Environmentenvironmentproduction, staging, development
Componentsource_componentretriever, generator, router
Typesource_typelangfuse, custom, opentelemetry
⚠️ Warning
Filtering is by source metadata, not by model name. To compare models, add a source_component or environment column to your data that identifies the model.

Time Range Presets

The time range picker offers quick presets plus a custom date range mode:

  • Last hour · Last 6 hours · Last 24 hours (default) · Last 7 days · Last 30 days
  • Custom range — opens two date inputs; click Apply to confirm

The reset button (↻) clears all metadata filters and resets the time range back to "Last 24 hours".

Data Formats

AXIS auto-detects your CSV format on upload. Two formats are supported:

Long Format (Recommended)

Each row is a single metric observation. Best for production pipelines that emit normalized metric data.

dataset_id,query,actual_output,metric_name,metric_score,timestamp,environment,source_name
01KFX...,What is...,The answer...,Faithfulness,0.85,2024-01-15T10:30:00,production,alpha_bot
01KFX...,What is...,The answer...,Relevance,0.92,2024-01-15T10:30:00,production,alpha_bot

Wide Format

Each row contains all metrics as columns. Columns ending in _score are auto-detected as metrics.

dataset_id,query,actual_output,faithfulness_score,relevance_score,timestamp
01KFX...,What is...,The answer...,0.85,0.92,2024-01-15T10:30:00
💡 Tip
Long format gives you richer per-metric metadata and works better with the Classification and Analysis tabs. Use wide format only for quick ad-hoc uploads.

Data Pipeline (DuckDB)

For production use, connect AXIS to your PostgreSQL database. The sync engine handles everything automatically:

1

Concurrent Read

The sync engine runs dataset_query and results_query concurrently against your Postgres database.

2

Write to DuckDB

Data is written to internal DuckDB tables (monitoring_dataset and monitoring_results).

3

Join & Serve

A JOIN view (monitoring_data) combines both tables. All /api/monitoring/analytics/* endpoints query DuckDB directly for fast aggregation.

4

Incremental Sync

Periodic sync keeps DuckDB up-to-date without restarts. KPIs use a metadata cache fast path — pre-aggregated during sync and served instantly.

Next Steps

AXIS Documentation · Built with MkDocs Material