Monitoring
Real-time observability for production AI systems — track scores, latency, classifications, and anomalies across every deployment.
Why Use Monitoring?
Your evaluation metrics don't stop mattering after deployment. The Monitoring module gives you continuous visibility into how your AI agents and models perform in production, so you can catch regressions before users do.
Score Trends
Track faithfulness, relevance, and custom metrics over time with configurable granularity.
Latency Tracking
Monitor P50/P95/P99 response times grouped by environment, source, or component.
Classification Breakdowns
See how categorical metrics shift over time — topic drift, sentiment changes, intent distribution.
Anomaly Alerts
Automatic z-score, moving average, and threshold-based alerting with zero configuration.
Quick Start
Get monitoring data flowing in under two minutes:
Navigate to Monitor
Click Monitor in the left sidebar. If no data is loaded, you'll see the upload screen.
Upload CSV or Connect Database
Drag a CSV file into the upload zone, or configure a DuckDB/Postgres connection in Settings → Database for automatic sync.
Explore Your Metrics
Once data loads, the dashboard populates with KPI cards, trend charts, and trace tables. Use the filter bar to slice by environment, source, or time range.
Page Anatomy
Here's how the Monitoring dashboard is organized, with every major section labeled:
Monitor
Real-time performance monitoring and alerts
Score Trend
environment, source_component, and source_type. Only visible when the data contains those fields.Score Metrics Tab
The default tab. Shows numeric evaluation scores (0–1) for metrics like Faithfulness, Relevance, Completeness, and any custom score metrics in your data.
Score Trend Chart
A full-width time-series chart plotting each metric as a separate line with P50–P95 band shading. Two dashed reference lines mark your configured thresholds:
- Good threshold (default 0.7) — green dashed line
- Pass threshold (default 0.5) — yellow dashed line
Controls:
- Granularity toggle (top-right of chart) — switch between
Hourly,Daily, andWeeklyaggregation - Hover any data point to see metric name, timestamp, avg/P50/P95/P99 scores, and sample count
- The legend at the bottom is interactive — click a metric name to show/hide its line
Latency & Metric Breakdown Charts
Below the trend chart, two charts sit side by side:
Latency Distribution
Metric Pass Rates
Latency Distribution
- Histogram showing how response times are distributed
- Vertical dashed lines mark P50, P95, and P99 percentiles
- Use the Group By dropdown (top-right) to split by
Environment,Source,Component, orType
Metric Pass Rates
- Horizontal bar chart — one bar per metric
- Color-coded: ≥ 70% green, 50–70% yellow, < 50% red
- Reference lines at the 50% and 70% marks
- Hover any bar for the exact pass rate, average score, and sample count
Top Failing Outputs
Below the charts, a table surfaces the 15 lowest-scoring traces across all metrics — your worst performers at a glance.
Top Failing Outputs
| Metric | Trace ID | Timestamp | Input | LLM Output | |
|---|---|---|---|---|---|
| Faithfulness 0.12 | 01KFX...a3b | Feb 14, 14:23 | 👁 | ||
| Relevance 0.18 | 01KFX...d7e | Feb 14, 13:41 | 👁 | ||
| Completeness 0.31 | 01KFX...f2c | Feb 14, 12:09 | 👁 |
- The Threshold dropdown (top-right) controls the cutoff — options: 0.3, 0.5, 0.7, 0.8, 0.9
- Each row shows a colored indicator: red for scores below half the threshold, yellow otherwise
- Click the eye icon on any row to open the Trace Detail Modal
Evaluation Traces Table
The full paginated trace table appears below, separated by an "ALL TRACES" divider.
Evaluation Traces 12,847 records
| Trace ID | Timestamp ↕ | Model | Env | Latency | Metric | Score | |
|---|---|---|---|---|---|---|---|
| 01KFX...c9a | Feb 14, 15:02 | gpt-4o | production | 0.84s | Faithfulness | 0.92 | 👁 |
| 01KFX...c9a | Feb 14, 15:02 | gpt-4o | production | 0.84s | Relevance | 0.88 | 👁 |
| 01KFX...b4f | Feb 14, 14:58 | claude-3.5 | staging | 1.23s | Faithfulness | 0.61 | 👁 |
| 01KFX...a1d | Feb 14, 14:55 | gpt-4o | production | 2.41s | Completeness | 0.34 | 👁 |
Controls in the header bar:
- Search — filters by trace ID, query text, or output text (debounced 300ms)
- Metric filter — show only a specific metric (e.g., just "Faithfulness")
- Sort — toggle between "Recent" (timestamp) and "Score"; click the arrow to flip ascending/descending
- Pagination — 10 records per page, server-side when using DuckDB
Classification Tab
Visible when your data includes metrics with metric_category: CLASSIFICATION (e.g., intent detection, topic classification, sentiment). This tab shows how categorical values are distributed and how they shift over time.
Classification Summary
Category Distribution
Category Trends
Key interactions:
- Source toggle (top) — switch between grouping by
ExplanationorOutputfield - Metric selector — choose which classification metric to analyze
- Trends toggle — switch the Y-axis between percentage (relative) and count (absolute)
Analysis Insights Tab
For metrics with metric_category: ANALYSIS — structured JSON evaluations like detailed reasoning, multi-signal assessments, or chain-of-thought outputs. Instead of charts, this tab shows an expandable card feed.
Recent Analysis Records
- Cards are paginated (10 per page) with controls in the header
- The first card auto-expands on load
- Click any card header to expand it and see the full query, output, explanation, and signals
- Signals render as colored badges: ✓ pass or ✗ fail
Alerts Tab
AXIS automatically detects anomalies using multiple detection methods — no configuration required. When issues are found, the Alerts tab shows a badge count on the tab header.
Alert types and their value blocks:
| Detection Method | Shows | Description |
|---|---|---|
| threshold | Current, Target, Gap | Value crossed a configured threshold |
| z-score | Current, Z-Score, Threshold | Statistical outlier (default: |z| > 2.0) |
| moving-average | Current, MA, Deviation | Deviated from rolling average |
| rate-of-change | Current, Previous, Change | Sudden metric shift between periods |
Trace Detail Modal
Click the eye icon on any trace row to open a full detail view. The modal shows everything about a single evaluation record in one place.
Modal sections (top to bottom):
- Score header — large colored score circle with progress bar and metric name
- Trace ID banner — full ID with a copy button
- Input / Output — side-by-side panels showing the query and LLM response
- Expected Output — the ground truth, if present in your data
- Explanation — blue-highlighted evaluator reasoning
- Critique — yellow-highlighted critique notes, if present
- Signals — colored pass/fail badges from structured evaluation outputs
- Additional Metadata — any extra fields, rendered as key-value pairs or collapsible JSON
- Footer — model, environment, latency, timestamp at a glance
Filtering & Time Ranges
Filters apply globally across all tabs. When you change a filter, the KPI cards, charts, tables, and alerts all update together.
Metadata Filters
The filter dropdowns at the top appear conditionally — only if your data contains the corresponding field:
| Filter | Data Field | Example Values |
|---|---|---|
| Environment | environment | production, staging, development |
| Component | source_component | retriever, generator, router |
| Type | source_type | langfuse, custom, opentelemetry |
source_component or environment column to your data that identifies the model.
Time Range Presets
The time range picker offers quick presets plus a custom date range mode:
Last hour·Last 6 hours·Last 24 hours(default) ·Last 7 days·Last 30 days- Custom range — opens two date inputs; click
Applyto confirm
The reset button (↻) clears all metadata filters and resets the time range back to "Last 24 hours".
Data Formats
AXIS auto-detects your CSV format on upload. Two formats are supported:
Long Format (Recommended)
Each row is a single metric observation. Best for production pipelines that emit normalized metric data.
dataset_id,query,actual_output,metric_name,metric_score,timestamp,environment,source_name
01KFX...,What is...,The answer...,Faithfulness,0.85,2024-01-15T10:30:00,production,alpha_bot
01KFX...,What is...,The answer...,Relevance,0.92,2024-01-15T10:30:00,production,alpha_bot
Wide Format
Each row contains all metrics as columns. Columns ending in _score are auto-detected as metrics.
dataset_id,query,actual_output,faithfulness_score,relevance_score,timestamp
01KFX...,What is...,The answer...,0.85,0.92,2024-01-15T10:30:00
Data Pipeline (DuckDB)
For production use, connect AXIS to your PostgreSQL database. The sync engine handles everything automatically:
Concurrent Read
The sync engine runs dataset_query and results_query concurrently against your Postgres database.
Write to DuckDB
Data is written to internal DuckDB tables (monitoring_dataset and monitoring_results).
Join & Serve
A JOIN view (monitoring_data) combines both tables. All /api/monitoring/analytics/* endpoints query DuckDB directly for fast aggregation.
Incremental Sync
Periodic sync keeps DuckDB up-to-date without restarts. KPIs use a metadata cache fast path — pre-aggregated during sync and served instantly.