Data Formats¶
AXIS supports multiple data formats for evaluation results and production monitoring. The backend automatically detects the format on upload and normalizes column names to a standard schema.
Evaluation Formats¶
Evaluation data represents the results of running LLM evaluations. AXIS detects five formats, checked in priority order:
| Format | Key Columns | Use Case |
|---|---|---|
eval_runner |
run_id, dataset_id, passed |
Output from AXIS evaluation runner |
tree_format |
metric_name, parent, metric_type, metric_score |
Hierarchical metrics with parent relationships |
flat_format |
metric_name, metric_score |
Simple metric scores in long format |
simple_judgment |
judgment |
Binary pass/fail judgments |
fresh_annotation |
dataset_id, evaluation_name, query, actual_output |
Raw outputs ready for human annotation |
Format Detection Flow¶
The backend's detect_data_format() function checks for formats in a strict priority order. The first match wins:
flowchart TD
START([Upload CSV]) --> A{Has run_id +<br/>dataset_id + passed?}
A -- Yes --> R[eval_runner]
A -- No --> B{Has metric_name +<br/>parent + metric_type +<br/>metric_score?}
B -- Yes --> T[tree_format]
B -- No --> C{Has metric_name +<br/>metric_score?}
C -- Yes --> F[flat_format]
C -- No --> D{Has judgment<br/>column?}
D -- Yes --> J[simple_judgment]
D -- No --> E{Has dataset_id +<br/>evaluation_name +<br/>query + actual_output?}
E -- Yes --> A2[fresh_annotation]
E -- No --> U[unknown]
Tree Format Example¶
Hierarchical metrics where each row is a single metric observation with parent-child relationships:
dataset_id,query,actual_output,metric_name,metric_score,metric_type,parent,weight,explanation
REC-001,What is AI?,AI is...,Overall Quality,0.82,metric,,1.0,
REC-001,What is AI?,AI is...,Faithfulness,0.90,component,Overall Quality,0.5,Accurate statements
REC-001,What is AI?,AI is...,Relevance,0.74,component,Overall Quality,0.5,Mostly relevant
Flat Format Example¶
Simple long-format scores without hierarchy:
dataset_id,query,actual_output,metric_name,metric_score,explanation
REC-001,What is AI?,AI is...,Faithfulness,0.90,Accurate statements
REC-001,What is AI?,AI is...,Relevance,0.74,Mostly relevant
REC-002,Explain ML,ML is...,Faithfulness,0.85,Good accuracy
Simple Judgment Example¶
Binary pass/fail evaluations:
dataset_id,query,actual_output,judgment
REC-001,What is AI?,AI is...,pass
REC-002,Explain ML,ML is...,fail
Monitoring Formats¶
Monitoring data represents time-series observations from production systems. Two formats are supported:
Long Format (Recommended)¶
Each row represents a single metric observation. Detected when data has both metric_name and metric_score columns.
dataset_id,query,actual_output,metric_name,metric_score,metric_category,timestamp,environment,source_name
01KFX,What is...,The answer,Faithfulness,0.85,SCORE,2024-01-15T10:30:00,production,my_agent
01KFX,What is...,The answer,Relevance,0.92,SCORE,2024-01-15T10:30:00,production,my_agent
01KFX,What is...,The answer,Topic,RELEVANT,CLASSIFICATION,2024-01-15T10:30:00,production,my_agent
When to use long format
Long format is the recommended format for production monitoring. It maps naturally to normalized database schemas and supports mixed metric categories per record.
Wide Format¶
Each row contains all metrics as separate columns. Columns ending in _score are auto-detected as metrics.
dataset_id,query,actual_output,faithfulness_score,relevance_score,timestamp
01KFX,What is...,The answer,0.85,0.92,2024-01-15T10:30:00
Metric Detection Logic¶
def detect_metric_columns(df):
# Long format: metric_name + metric_score columns present
if "metric_name" in df.columns and "metric_score" in df.columns:
return ["metric_score"]
# Wide format: columns ending with _score
return [col for col in df.columns if col.endswith("_score")]
Column Normalization¶
The monitoring upload automatically normalizes common column name variations. The normalization is case-insensitive and replaces spaces/hyphens with underscores before lookup.
Normalization Map¶
| Input Alias | Normalized To |
|---|---|
id |
dataset_id |
record_id |
dataset_id |
dataset_id |
dataset_id |
| Input Alias | Normalized To |
|---|---|
time |
timestamp |
created_at |
timestamp |
dataset_created_at |
timestamp |
| Input Alias | Normalized To |
|---|---|
input |
query |
prompt |
query |
user_input |
query |
output |
actual_output |
response |
actual_output |
model_output |
actual_output |
completion |
actual_output |
| Input Alias | Normalized To |
|---|---|
model |
model_name |
agent |
model_name |
agent_name |
model_name |
| Input Alias | Normalized To |
|---|---|
env |
environment |
stage |
environment |
| Input Alias | Normalized To |
|---|---|
latency_ms |
latency |
response_time |
latency |
error |
has_errors |
Custom column mappings
When loading from a database via YAML config, you can specify a columns map that takes precedence over the default normalization. This allows mapping arbitrary column names to the standard schema.
Metric Categories¶
Each metric in the monitoring long format can have a metric_category that determines how the metric output is interpreted and displayed:
| Category | Value Type | Display | Example Output |
|---|---|---|---|
SCORE |
Numeric (0-1) | Charts, thresholds, pass/fail, trend lines | 0.85, 0.92 |
CLASSIFICATION |
String | Category breakdowns, distribution bar charts | "RELEVANT", "POSITIVE" |
ANALYSIS |
JSON / Text | Structured detail views, signal inspectors | {"issues": [...], "summary": "..."} |
graph LR
M[metric_category] --> S[SCORE]
M --> C[CLASSIFICATION]
M --> A[ANALYSIS]
S --> S1[Trend charts]
S --> S2[Pass/fail rates]
S --> S3[Thresholds]
C --> C1[Category bar charts]
C --> C2[Distribution breakdowns]
C --> C3[Trend stacked areas]
A --> A1[Signal inspector]
A --> A2[Detail view cards]
A --> A3[Paginated records]
If metric_category is not provided, the metric defaults to SCORE.
Shared Constants¶
Column names, thresholds, and colors are defined in both the Python backend and the TypeScript frontend to keep them in sync.
class Columns:
DATASET_ID = "dataset_id"
QUERY = "query"
ACTUAL_OUTPUT = "actual_output"
METRIC_NAME = "metric_name"
METRIC_SCORE = "metric_score"
METRIC_CATEGORY = "metric_category"
TIMESTAMP = "timestamp"
# ... 40+ constants
@dataclass(frozen=True)
class Thresholds:
PASSING_RATE: float = 0.5
GREEN_THRESHOLD: float = 0.7
RED_THRESHOLD: float = 0.3
@dataclass(frozen=True)
class Colors:
PALETTE = {
"primary": "#8B9F4F",
"success": "#27AE60",
"warning": "#F39C12",
"error": "#E74C3C",
}
export const Columns = {
DATASET_ID: 'dataset_id',
QUERY: 'query',
ACTUAL_OUTPUT: 'actual_output',
METRIC_NAME: 'metric_name',
METRIC_SCORE: 'metric_score',
METRIC_CATEGORY: 'metric_category',
TIMESTAMP: 'timestamp',
// ... 40+ constants
} as const;
export const Thresholds = {
PASSING_RATE: 0.5,
GREEN_THRESHOLD: 0.7,
RED_THRESHOLD: 0.3,
} as const;
export const DefaultColors = {
primary: '#8B9F4F',
success: '#27AE60',
warning: '#F39C12',
error: '#E74C3C',
} as const;
Keep in sync
When adding or renaming a column constant, update both backend/app/config/constants.py and frontend/src/types/index.ts. There is no code generation step -- these are maintained manually.
Monitoring Column Reference¶
Complete reference of standard monitoring columns:
| Column | Required | Type | Description |
|---|---|---|---|
dataset_id |
Yes | string | Unique identifier for the record |
timestamp |
No* | datetime | ISO datetime; auto-generated if missing |
query |
No | string | Input/prompt text |
actual_output |
No | string | LLM response text |
expected_output |
No | string | Ground truth for comparison |
metric_name |
No | string | Metric name (long format) |
metric_score |
No | number | Metric value (long format) |
metric_type |
No | string | Type of metric (component, aggregate) |
metric_category |
No | string | Output type: SCORE, CLASSIFICATION, ANALYSIS |
environment |
No | string | Deployment environment (e.g., production) |
source_name |
No | string | Agent or source system name (e.g., alpha_bot) |
source_component |
No | string | Component within source (e.g., retrieval) |
source_type |
No | string | Source type (e.g., langfuse, custom) |
evaluation_name |
No | string | Name of the evaluation/experiment |
model_name |
No | string | Model used for generation |
trace_id |
No | string | Observability trace ID |
latency |
No | number | Response time in milliseconds |
passed |
No | boolean | Whether the metric passed its threshold |
threshold |
No | number | Pass/fail threshold value |
explanation |
No | string | Metric explanation text |
signals |
No | JSON | Structured evaluation signals |
*Timestamp is auto-generated with the current time if not present in the uploaded data.
Human Signals Formats¶
The Human Signals (HITL) dashboard supports two data versions:
V1 Format¶
Case-level records with fixed columns like Case_ID, Business, Has_Intervention, Intervention_Type, Sentiment, Final_Outcome, etc. This is a hardcoded schema.
V2 Format (Data-Driven)¶
Generic signal records where metrics are flattened as {metric_name}__{signal_key} columns. The backend auto-discovers the schema and returns:
metric_schema: Describes discovered metrics, their categories, and signal typesdisplay_config: YAML-driven layout for KPIs, charts, filters, and table columns
The frontend auto-detects V1 vs V2 based on the presence of metric_schema in the upload response.