Metrics & Evaluation¶
Axion provides 30+ metrics for evaluating AI agents across multiple dimensions.
Quick Start¶
from axion import Dataset, metric_registry
from axion.metrics import Faithfulness, AnswerRelevancy
# Load your dataset
dataset = Dataset.from_csv("eval_data.csv")
# Select metrics
metrics = [Faithfulness(), AnswerRelevancy()]
# Run evaluation
from axion.runners import evaluation_runner
results = await evaluation_runner(dataset, metrics)
Metric Output Types¶
Score
Numeric value (0–1) with pass/fail threshold. Example: Faithfulness → 0.85
Classification
Single label from a fixed set. Example: SentimentClassification → "positive"
Analysis
Structured insights without scoring. Example: ReferralReasonAnalysis → {reasons[], citations[]}
See Creating Custom Metrics for details on choosing the right output type.
Metric Categories¶
Composite (LLM-based)
Nuanced evaluation requiring reasoning. Faithfulness, AnswerRelevancy, FactualAccuracy, AnswerCompleteness, AnswerCriteria.
Heuristic (Non-LLM)
Fast, deterministic checks. ExactStringMatch, CitationPresence, Latency, ContainsMatch.
Retrieval
RAG pipeline evaluation. HitRateAtK, MeanReciprocalRank, ContextualRelevancy, ContextualSufficiency.
Conversational
Multi-turn agent evaluation. GoalCompletion, ConversationEfficiency, ConversationFlow.
Composite Metrics (LLM-based)¶
| Metric | What it Measures |
|---|---|
Faithfulness |
Is the answer grounded in retrieved context? |
AnswerRelevancy |
Does the answer address the question? |
FactualAccuracy |
Is the answer factually correct vs. expected? |
AnswerCompleteness |
Are all parts of the question answered? |
AnswerCriteria |
Does the answer meet specific business rules? |
Heuristic Metrics (Non-LLM)¶
| Metric | What it Measures |
|---|---|
ExactStringMatch |
Exact match between actual and expected |
CitationPresence |
Are citations/references present? |
Latency |
Response time (pass/fail threshold) |
ContainsMatch |
Does output contain required phrases? |
Retrieval Metrics¶
| Metric | What it Measures |
|---|---|
HitRateAtK |
Is the right doc in top K results? |
MeanReciprocalRank |
Position of first relevant result |
ContextualRelevancy |
Are retrieved chunks relevant? |
ContextualSufficiency |
Do chunks contain the answer? |
Conversational Metrics¶
| Metric | What it Measures |
|---|---|
GoalCompletion |
Did user achieve their goal? |
ConversationEfficiency |
Were there unnecessary loops? |
ConversationFlow |
Is the dialogue logical? |
Using the Metric Registry¶
from axion.metrics import metric_registry
# List all available metrics
print(metric_registry.list_metrics())
# Get metric by name
metric = metric_registry.get("Faithfulness")
# Filter by category
composite_metrics = metric_registry.filter(category="composite")
Customizing Metrics¶
from axion.metrics import Faithfulness
# Adjust threshold
metric = Faithfulness(threshold=0.8)
# Custom instructions
metric = AnswerCriteria(
criteria_key="my_criteria",
scoring_strategy="aspect"
)
Running Evaluations Creating Custom Metrics Metrics Reference