Answer Relevancy¶

Evaluate how well an AI response addresses the input query
LLM-Powered Knowledge Single Turn Multi-Turn

At a Glance¶

🎯
Score Range
0.0 ──────── 1.0
Ratio of relevant statements

⚡
Default Threshold
0.5
Pass/fail cutoff

📋
Required Inputs
query actual_output
Optional: conversation

What It Measures

Answer Relevancy evaluates whether each statement in the AI's response directly addresses the user's query. Unlike Faithfulness (which checks factual grounding), this metric measures topical alignment—did the AI stay on topic or go off on tangents?

Score	Interpretation
1.0	Every statement directly addresses the query
0.7+	Mostly relevant with minor tangents
0.5	Threshold—mix of relevant and off-topic content
< 0.5	Significant off-topic or irrelevant content

✅ Use When

Q&A systems & chatbots
Customer support agents
Search result evaluation
Any query-response system

❌ Don't Use When

Open-ended conversations
Exploratory discussions
No clear query/question
Tasks where tangents are valuable

Step-by-Step Process¶

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Query]
        B[AI Response]
    end

    subgraph EXTRACT["🔍 Step 1: Statement Extraction"]
        C[StatementExtractor LLM]
        D["Atomic Statements<br/><small>Self-contained facts</small>"]
    end

    subgraph JUDGE["⚖️ Step 2: Relevancy Judgment"]
        E[RelevancyJudge LLM]
        F["Verdict per Statement<br/><small>yes / no / idk</small>"]
    end

    subgraph SCORE["📊 Step 3: Scoring"]
        G["Count Relevant"]
        H["Calculate Ratio"]
        I["Final Score"]
    end

    A & B --> C
    C --> D
    D --> E
    A --> E
    E --> F
    F --> G
    G --> H
    H --> I

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style EXTRACT stroke:#3b82f6,stroke-width:2px
    style JUDGE stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

Each extracted statement receives a verdict indicating its relevance to the query.

✅ YES

1.0

Statement directly addresses the query. Clearly relevant.

❓ IDK

1.0

Ambiguous relevance. Configurable—can be 0.0 with penalize_ambiguity=True

❌ NO

0.0

Statement is off-topic or doesn't address the query at all.

Score Formula

score = (yes_count + idk_count*) / total_statements

* idk_count included only if penalize_ambiguity=False (default)

Configuration¶

Parameters Strict Configuration Multi-Turn

Parameter	Type	Default	Description
`relevancy_mode`	`'strict'` \| `'task'`	`'task'`	strict: Only direct answers count. task: Helpful related info also counts
`penalize_ambiguity`	`bool`	`False`	When `True`, ambiguous (`idk`) verdicts score 0.0 instead of 1.0
`multi_turn_strategy`	`'last_turn'` \| `'all_turns'`	`'last_turn'`	How to evaluate conversations
`mode`	`EvaluationMode`	`GRANULAR`	Evaluation detail level

Relevancy Modes

task mode (default): More lenient—counts closely related, helpful information as relevant
strict mode: Only statements that directly answer the question count as relevant

For high-precision evaluation where tangential information should be penalized:

from axion.metrics import AnswerRelevancy

# Strict evaluation: only direct answers, penalize ambiguity
metric = AnswerRelevancy(
    relevancy_mode='strict',
    penalize_ambiguity=True
)

For conversational AI evaluation:

from axion.metrics import AnswerRelevancy

# Evaluate all turns in a conversation
metric = AnswerRelevancy(
    multi_turn_strategy='all_turns'  # or 'last_turn' (default)
)

last_turn: Only evaluates the final Human→AI exchange
all_turns: Evaluates every turn and aggregates via micro-averaging

Code Examples¶

Basic Usage Strict Mode Multi-Turn Conversation

from axion.metrics import AnswerRelevancy
from axion.dataset import DatasetItem

# Initialize with defaults (task mode, lenient)
metric = AnswerRelevancy()

item = DatasetItem(
    query="What features does this laptop have?",
    actual_output=(
        "The laptop has a 15-inch Retina display and 16GB of RAM. "
        "It also comes with a 1-year warranty. "
        "Our company was founded in 2010."
    ),
)

result = await metric.execute(item)
print(result.pretty())
# Score ~0.67: warranty is borderline, founding year is irrelevant

from axion.metrics import AnswerRelevancy

# Strict: only direct answers count
metric = AnswerRelevancy(
    relevancy_mode='strict',
    penalize_ambiguity=True
)

# Now only "15-inch display" and "16GB RAM" statements count
# Warranty = ambiguous (0.0), founding year = no (0.0)

from axion.metrics import AnswerRelevancy
from axion.dataset import DatasetItem, MultiTurnConversation
from axion.schema import HumanMessage, AIMessage

conversation = MultiTurnConversation(messages=[
    HumanMessage(content="What is Python?"),
    AIMessage(content="Python is a programming language known for readability."),
    HumanMessage(content="What are its main uses?"),
    AIMessage(content="Python is used for web dev, data science, and automation."),
])

metric = AnswerRelevancy(multi_turn_strategy='all_turns')
item = DatasetItem(conversation=conversation)

result = await metric.execute(item)
print(f"Evaluated {result.signals.evaluated_turns_count} turns")

Metric Diagnostics¶

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown

📊 AnswerRelevancyResult Structure

AnswerRelevancyResult(
{
    "overall_score": 1.0,
    "explanation": "The score is 1.0 because the response fully and accurately explains...",
    "relevant_statements_count": 2,
    "irrelevant_statements_count": 0,
    "ambiguous_statements_count": 0,
    "total_statements_count": 2,
    "statement_breakdown": [
        {
            "statement": "The infield fly rule prevents the defense from dropping a fly ball.",
            "verdict": "yes",
            "is_relevant": true,
            "turn_index": 0
        },
        {
            "statement": "The rule prevents an easy double play when runners are on base.",
            "verdict": "yes",
            "is_relevant": true,
            "turn_index": 0
        }
    ],
    "evaluated_turns_count": 1
}
)

Signal Fields¶

Field	Type	Description
`overall_score`	`float`	The 0-1 relevancy score
`explanation`	`str`	Human-readable summary of why the score was given
`relevant_statements_count`	`int`	Count of `yes` verdicts
`irrelevant_statements_count`	`int`	Count of `no` verdicts
`ambiguous_statements_count`	`int`	Count of `idk` verdicts
`total_statements_count`	`int`	Total statements extracted
`statement_breakdown`	`List`	Per-statement verdict details
`evaluated_turns_count`	`int`	Number of conversation turns evaluated

Example Scenarios¶

✅ Scenario 1: Perfect Relevancy (Score: 1.0)

All Statements Relevant

Query:

"What are the health benefits of green tea?"

AI Response:

"Green tea contains antioxidants that may reduce inflammation. It also has caffeine which can improve alertness."

Analysis:

Statement	Verdict	Score
Green tea contains antioxidants that may reduce inflammation	yes	1.0
Green tea has caffeine which can improve alertness	yes	1.0

Final Score: 2 / 2 = 1.0

⚠️ Scenario 2: Partial Relevancy (Score: 0.67)

Mixed Verdicts

Query:

"What features does this laptop have?"

AI Response:

"The laptop has a 15-inch display. It has 16GB RAM. Our company has excellent customer service."

Analysis:

Statement	Verdict	Score
The laptop has a 15-inch display	yes	1.0
The laptop has 16GB RAM	yes	1.0
Our company has excellent customer service	no	0.0

Final Score: 2 / 3 = 0.67

The customer service statement doesn't address laptop features.

❌ Scenario 3: Poor Relevancy (Score: 0.25)

Mostly Off-Topic

Query:

"How do I reset my password?"

AI Response:

"Our platform uses industry-standard encryption. We were founded in 2015. Password resets can be done via email. We have offices in 3 countries."

Analysis:

Statement	Verdict	Score
Our platform uses industry-standard encryption	no	0.0
We were founded in 2015	no	0.0
Password resets can be done via email	yes	1.0
We have offices in 3 countries	no	0.0

Final Score: 1 / 4 = 0.25

Only one statement actually answers the question.

Why It Matters¶

🎯 User Experience

Users expect direct answers. Off-topic responses frustrate users and reduce trust in your AI system.

💬 Conversation Quality

For chatbots and assistants, staying on topic is crucial. Tangential responses break conversational flow.

🔍 Debug Generation

Identifies when your model goes off-topic—separate from retrieval issues (Faithfulness) or factual errors.

Quick Reference¶

TL;DR

Answer Relevancy = Does the AI's response actually address what the user asked?

Use it when: You need to ensure responses stay on topic
Score interpretation: Higher = more statements directly address the query
Key config: Use relevancy_mode='strict' for precision, 'task' for lenient evaluation

API Reference

axion.metrics.AnswerRelevancy
Related Metrics

Faithfulness · Answer Completeness · Context Precision