Answer Relevancy¶
LLM-Powered Knowledge Single Turn Multi-Turn
At a Glance¶
Score Range
0.0 ──────── 1.0Ratio of relevant statements
Default Threshold
0.5Pass/fail cutoff
Required Inputs
query actual_outputOptional:
conversation
What It Measures
Answer Relevancy evaluates whether each statement in the AI's response directly addresses the user's query. Unlike Faithfulness (which checks factual grounding), this metric measures topical alignment—did the AI stay on topic or go off on tangents?
| Score | Interpretation |
|---|---|
| 1.0 | Every statement directly addresses the query |
| 0.7+ | Mostly relevant with minor tangents |
| 0.5 | Threshold—mix of relevant and off-topic content |
| < 0.5 | Significant off-topic or irrelevant content |
- Q&A systems & chatbots
- Customer support agents
- Search result evaluation
- Any query-response system
- Open-ended conversations
- Exploratory discussions
- No clear query/question
- Tasks where tangents are valuable
See Also: Faithfulness
Answer Relevancy checks if statements address the user's query (topical alignment). Faithfulness checks if claims are grounded in the source context (factual accuracy).
Use both together for comprehensive RAG evaluation.
How It Works
The metric uses an Evaluator LLM to decompose the response into atomic statements, then judge each statement's relevance to the query.
Step-by-Step Process¶
flowchart TD
subgraph INPUT["📥 Inputs"]
A[Query]
B[AI Response]
end
subgraph EXTRACT["🔍 Step 1: Statement Extraction"]
C[StatementExtractor LLM]
D["Atomic Statements<br/><small>Self-contained facts</small>"]
end
subgraph JUDGE["⚖️ Step 2: Relevancy Judgment"]
E[RelevancyJudge LLM]
F["Verdict per Statement<br/><small>yes / no / idk</small>"]
end
subgraph SCORE["📊 Step 3: Scoring"]
G["Count Relevant"]
H["Calculate Ratio"]
I["Final Score"]
end
A & B --> C
C --> D
D --> E
A --> E
E --> F
F --> G
G --> H
H --> I
style INPUT stroke:#1E3A5F,stroke-width:2px
style EXTRACT stroke:#3b82f6,stroke-width:2px
style JUDGE stroke:#f59e0b,stroke-width:2px
style SCORE stroke:#10b981,stroke-width:2px
style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff
Each extracted statement receives a verdict indicating its relevance to the query.
Statement directly addresses the query. Clearly relevant.
Ambiguous relevance. Configurable—can be 0.0 with
penalize_ambiguity=True
Statement is off-topic or doesn't address the query at all.
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
relevancy_mode |
'strict' | 'task' |
'task' |
strict: Only direct answers count. task: Helpful related info also counts |
penalize_ambiguity |
bool |
False |
When True, ambiguous (idk) verdicts score 0.0 instead of 1.0 |
multi_turn_strategy |
'last_turn' | 'all_turns' |
'last_turn' |
How to evaluate conversations |
mode |
EvaluationMode |
GRANULAR |
Evaluation detail level |
Relevancy Modes
taskmode (default): More lenient—counts closely related, helpful information as relevantstrictmode: Only statements that directly answer the question count as relevant
For high-precision evaluation where tangential information should be penalized:
For conversational AI evaluation:
from axion.metrics import AnswerRelevancy
# Evaluate all turns in a conversation
metric = AnswerRelevancy(
multi_turn_strategy='all_turns' # or 'last_turn' (default)
)
last_turn: Only evaluates the final Human→AI exchangeall_turns: Evaluates every turn and aggregates via micro-averaging
Code Examples¶
from axion.metrics import AnswerRelevancy
from axion.dataset import DatasetItem
# Initialize with defaults (task mode, lenient)
metric = AnswerRelevancy()
item = DatasetItem(
query="What features does this laptop have?",
actual_output=(
"The laptop has a 15-inch Retina display and 16GB of RAM. "
"It also comes with a 1-year warranty. "
"Our company was founded in 2010."
),
)
result = await metric.execute(item)
print(result.pretty())
# Score ~0.67: warranty is borderline, founding year is irrelevant
from axion.metrics import AnswerRelevancy
from axion.dataset import DatasetItem, MultiTurnConversation
from axion.schema import HumanMessage, AIMessage
conversation = MultiTurnConversation(messages=[
HumanMessage(content="What is Python?"),
AIMessage(content="Python is a programming language known for readability."),
HumanMessage(content="What are its main uses?"),
AIMessage(content="Python is used for web dev, data science, and automation."),
])
metric = AnswerRelevancy(multi_turn_strategy='all_turns')
item = DatasetItem(conversation=conversation)
result = await metric.execute(item)
print(f"Evaluated {result.signals.evaluated_turns_count} turns")
Metric Diagnostics¶
Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.
result = await metric.execute(item)
print(result.pretty()) # Human-readable summary
result.signals # Full diagnostic breakdown
📊 AnswerRelevancyResult Structure
AnswerRelevancyResult(
{
"overall_score": 1.0,
"explanation": "The score is 1.0 because the response fully and accurately explains...",
"relevant_statements_count": 2,
"irrelevant_statements_count": 0,
"ambiguous_statements_count": 0,
"total_statements_count": 2,
"statement_breakdown": [
{
"statement": "The infield fly rule prevents the defense from dropping a fly ball.",
"verdict": "yes",
"is_relevant": true,
"turn_index": 0
},
{
"statement": "The rule prevents an easy double play when runners are on base.",
"verdict": "yes",
"is_relevant": true,
"turn_index": 0
}
],
"evaluated_turns_count": 1
}
)
Signal Fields¶
| Field | Type | Description |
|---|---|---|
overall_score |
float |
The 0-1 relevancy score |
explanation |
str |
Human-readable summary of why the score was given |
relevant_statements_count |
int |
Count of yes verdicts |
irrelevant_statements_count |
int |
Count of no verdicts |
ambiguous_statements_count |
int |
Count of idk verdicts |
total_statements_count |
int |
Total statements extracted |
statement_breakdown |
List |
Per-statement verdict details |
evaluated_turns_count |
int |
Number of conversation turns evaluated |
Example Scenarios¶
âś… Scenario 1: Perfect Relevancy (Score: 1.0)
All Statements Relevant
Query:
"What are the health benefits of green tea?"
AI Response:
"Green tea contains antioxidants that may reduce inflammation. It also has caffeine which can improve alertness."
Analysis:
| Statement | Verdict | Score |
|---|---|---|
| Green tea contains antioxidants that may reduce inflammation | yes | 1.0 |
| Green tea has caffeine which can improve alertness | yes | 1.0 |
Final Score: 2 / 2 = 1.0
⚠️ Scenario 2: Partial Relevancy (Score: 0.67)
Mixed Verdicts
Query:
"What features does this laptop have?"
AI Response:
"The laptop has a 15-inch display. It has 16GB RAM. Our company has excellent customer service."
Analysis:
| Statement | Verdict | Score |
|---|---|---|
| The laptop has a 15-inch display | yes | 1.0 |
| The laptop has 16GB RAM | yes | 1.0 |
| Our company has excellent customer service | no | 0.0 |
Final Score: 2 / 3 = 0.67
The customer service statement doesn't address laptop features.
❌ Scenario 3: Poor Relevancy (Score: 0.25)
Mostly Off-Topic
Query:
"How do I reset my password?"
AI Response:
"Our platform uses industry-standard encryption. We were founded in 2015. Password resets can be done via email. We have offices in 3 countries."
Analysis:
| Statement | Verdict | Score |
|---|---|---|
| Our platform uses industry-standard encryption | no | 0.0 |
| We were founded in 2015 | no | 0.0 |
| Password resets can be done via email | yes | 1.0 |
| We have offices in 3 countries | no | 0.0 |
Final Score: 1 / 4 = 0.25
Only one statement actually answers the question.
Why It Matters¶
Users expect direct answers. Off-topic responses frustrate users and reduce trust in your AI system.
For chatbots and assistants, staying on topic is crucial. Tangential responses break conversational flow.
Identifies when your model goes off-topic—separate from retrieval issues (Faithfulness) or factual errors.
Quick Reference¶
TL;DR
Answer Relevancy = Does the AI's response actually address what the user asked?
- Use it when: You need to ensure responses stay on topic
- Score interpretation: Higher = more statements directly address the query
- Key config: Use
relevancy_mode='strict'for precision,'task'for lenient evaluation
-
API Reference
-
Related Metrics
Faithfulness · Answer Completeness · Context Precision