Skip to content

Metrics & Evaluation

Axion provides 30+ metrics for evaluating AI agents across multiple dimensions.

Quick Start

from axion import Dataset, metric_registry
from axion.metrics import Faithfulness, AnswerRelevancy

# Load your dataset
dataset = Dataset.from_csv("eval_data.csv")

# Select metrics
metrics = [Faithfulness(), AnswerRelevancy()]

# Run evaluation
from axion.runners import evaluation_runner
results = await evaluation_runner(dataset, metrics)

Metric Output Types

S

Score

Numeric value (0–1) with pass/fail threshold. Example: Faithfulness0.85

C

Classification

Single label from a fixed set. Example: SentimentClassification"positive"

A

Analysis

Structured insights without scoring. Example: ReferralReasonAnalysis{reasons[], citations[]}

See Creating Custom Metrics for details on choosing the right output type.

Metric Categories

1

Composite (LLM-based)

Nuanced evaluation requiring reasoning. Faithfulness, AnswerRelevancy, FactualAccuracy, AnswerCompleteness, AnswerCriteria.

2

Heuristic (Non-LLM)

Fast, deterministic checks. ExactStringMatch, CitationPresence, Latency, ContainsMatch.

3

Retrieval

RAG pipeline evaluation. HitRateAtK, MeanReciprocalRank, ContextualRelevancy, ContextualSufficiency.

4

Conversational

Multi-turn agent evaluation. GoalCompletion, ConversationEfficiency, ConversationFlow.

Composite Metrics (LLM-based)

Metric What it Measures
Faithfulness Is the answer grounded in retrieved context?
AnswerRelevancy Does the answer address the question?
FactualAccuracy Is the answer factually correct vs. expected?
AnswerCompleteness Are all parts of the question answered?
AnswerCriteria Does the answer meet specific business rules?

Heuristic Metrics (Non-LLM)

Metric What it Measures
ExactStringMatch Exact match between actual and expected
CitationPresence Are citations/references present?
Latency Response time (pass/fail threshold)
ContainsMatch Does output contain required phrases?

Retrieval Metrics

Metric What it Measures
HitRateAtK Is the right doc in top K results?
MeanReciprocalRank Position of first relevant result
ContextualRelevancy Are retrieved chunks relevant?
ContextualSufficiency Do chunks contain the answer?

Conversational Metrics

Metric What it Measures
GoalCompletion Did user achieve their goal?
ConversationEfficiency Were there unnecessary loops?
ConversationFlow Is the dialogue logical?

Using the Metric Registry

from axion.metrics import metric_registry

# List all available metrics
print(metric_registry.list_metrics())

# Get metric by name
metric = metric_registry.get("Faithfulness")

# Filter by category
composite_metrics = metric_registry.filter(category="composite")

Customizing Metrics

from axion.metrics import Faithfulness

# Adjust threshold
metric = Faithfulness(threshold=0.8)

# Custom instructions
metric = AnswerCriteria(
    criteria_key="my_criteria",
    scoring_strategy="aspect"
)

Running Evaluations Creating Custom Metrics Metrics Reference