Citation Relevancy¶
LLM-Powered Knowledge Multi-Turn Citation
At a Glance¶
Score Range
0.0 ββββββββ 1.0Ratio of relevant citations
Default Threshold
0.8High bar for citation quality
Required Inputs
query actual_outputOptional:
conversation
What It Measures
Citation Relevancy evaluates whether the citations included in an AI response are actually relevant to the user's query. It extracts citations using pattern matching, then judges each citation's relevance using an LLM. Essential for research assistants and fact-checking systems.
| Score | Interpretation |
|---|---|
| 1.0 | All citations directly relevant to query |
| 0.8+ | Most citations relevant, minor tangents |
| 0.5 | Mixed relevanceβsome helpful, some off-topic |
| < 0.5 | Mostly irrelevant or unrelated citations |
- Building research assistants
- Fact-checking systems
- Academic writing tools
- Any system that generates citations
- Responses don't include citations
- Citation format is non-standard
- Internal linking (not external sources)
- Pure conversational AI
See Also: Faithfulness
Citation Relevancy checks if cited sources are relevant to the query. Faithfulness checks if claims are grounded in retrieved context.
Use Citation Relevancy for output validation; use Faithfulness for RAG grounding.
How It Works
The metric uses regex-based extraction followed by LLM-based relevance judgment.
Step-by-Step Process¶
flowchart TD
subgraph INPUT["π₯ Inputs"]
A[Query]
B[AI Response with Citations]
end
subgraph EXTRACT["π Step 1: Citation Extraction"]
C[Regex Pattern Matching]
D["Extracted Citations"]
end
subgraph JUDGE["βοΈ Step 2: Relevance Judgment"]
E[CitationRelevanceJudge LLM]
F["Verdict per Citation"]
end
subgraph SCORE["π Step 3: Scoring"]
G["Count Relevant"]
H["Calculate Ratio"]
I["Final Score"]
end
A & B --> C
C --> D
D --> E
A --> E
E --> F
F --> G
G --> H
H --> I
style INPUT stroke:#1E3A5F,stroke-width:2px
style EXTRACT stroke:#3b82f6,stroke-width:2px
style JUDGE stroke:#f59e0b,stroke-width:2px
style SCORE stroke:#10b981,stroke-width:2px
style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff
The metric extracts citations using multiple regex patterns:
Title
https://example.com/article
www.example.com/page
doi:10.1234/example
(Smith et al., 2023) or (Smith, 2023)
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
multi_turn_strategy |
'last_turn' | 'all_turns' |
'last_turn' |
How to evaluate conversations |
mode |
EvaluationMode |
GRANULAR |
Evaluation detail level |
Multi-Turn Support
In multi-turn conversations, citations are associated with their corresponding query context:
last_turn: Only evaluates citations in the final responseall_turns: Evaluates citations across all turns, matching each to its original query
Code Examples¶
from axion.metrics import CitationRelevancy
from axion.dataset import DatasetItem
metric = CitationRelevancy()
item = DatasetItem(
query="What are the health benefits of green tea?",
actual_output="""
Green tea has numerous health benefits:
1. Rich in antioxidants [Source](https://healthline.com/green-tea-benefits)
2. May improve brain function (Smith et al., 2020)
3. Great for parties! [Party Guide](https://party-planning.com)
""",
)
result = await metric.execute(item)
print(result.pretty())
# Score: 0.67 (2 of 3 citations relevant)
from axion.metrics import CitationRelevancy
from axion.dataset import DatasetItem, MultiTurnConversation
from axion.schema import HumanMessage, AIMessage
conversation = MultiTurnConversation(messages=[
HumanMessage(content="What causes climate change?"),
AIMessage(content="Climate change is primarily caused by greenhouse gases. [IPCC Report](https://ipcc.ch/report)"),
HumanMessage(content="How can I reduce my carbon footprint?"),
AIMessage(content="You can reduce emissions by using public transport. [EPA Guide](https://epa.gov/guide)"),
])
metric = CitationRelevancy(multi_turn_strategy='all_turns')
item = DatasetItem(conversation=conversation)
result = await metric.execute(item)
print(f"Evaluated {result.signals.total_citations} citations across turns")
from axion.metrics import CitationRelevancy
from axion.runners import MetricRunner
metric = CitationRelevancy()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)
for item_result in results:
print(f"Score: {item_result.score}")
print(f"Relevant: {item_result.signals.relevant_citations_count}/{item_result.signals.total_citations}")
for citation in item_result.signals.citation_breakdown:
status = "β
" if citation.relevance_verdict else "β"
print(f" {status} {citation.citation_text[:50]}...")
Metric Diagnostics¶
Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was givenβno black boxes.
result = await metric.execute(item)
print(result.pretty()) # Human-readable summary
result.signals # Full diagnostic breakdown
π CitationRelevancyResult Structure
CitationRelevancyResult(
{
"relevance_score": 0.67,
"total_citations": 3,
"relevant_citations_count": 2,
"irrelevant_citations_count": 1,
"citation_breakdown": [
{
"citation_text": "[Source](https://healthline.com/green-tea-benefits)",
"relevance_verdict": true,
"relevance_reason": "Directly addresses health benefits of green tea",
"turn_index": 0,
"original_query": "What are the health benefits of green tea?"
},
{
"citation_text": "(Smith et al., 2020)",
"relevance_verdict": true,
"relevance_reason": "Academic source on tea and brain function",
"turn_index": 0,
"original_query": "What are the health benefits of green tea?"
},
{
"citation_text": "[Party Guide](https://party-planning.com)",
"relevance_verdict": false,
"relevance_reason": "Party planning is unrelated to health benefits",
"turn_index": 0,
"original_query": "What are the health benefits of green tea?"
}
]
}
)
Signal Fields¶
| Field | Type | Description |
|---|---|---|
relevance_score |
float |
Ratio of relevant citations (0.0-1.0) |
total_citations |
int |
Total citations extracted |
relevant_citations_count |
int |
Count of relevant citations |
irrelevant_citations_count |
int |
Count of irrelevant citations |
citation_breakdown |
List |
Per-citation verdict details |
Citation Breakdown Fields¶
| Field | Type | Description |
|---|---|---|
citation_text |
str |
The extracted citation |
relevance_verdict |
bool |
Whether citation is relevant |
relevance_reason |
str |
Explanation for verdict |
turn_index |
int |
Conversation turn (for multi-turn) |
original_query |
str |
Query that prompted this citation |
Example Scenarios¶
β Scenario 1: All Citations Relevant (Score: 1.0)
High-Quality Citations
Query:
"What are effective treatments for type 2 diabetes?"
AI Response:
"Type 2 diabetes can be managed through:
- Lifestyle changes including diet and exercise ADA Guidelines
- Metformin as first-line medication (Johnson et al., 2021)
- Regular blood glucose monitoring CDC Recommendations"
Analysis:
| Citation | Verdict | Reason |
|---|---|---|
| ADA Guidelines | β | Authoritative diabetes organization |
| Johnson et al., 2021 | β | Academic source on medication |
| CDC Recommendations | β | Government health authority |
Final Score: 3 / 3 = 1.0
β οΈ Scenario 2: Mixed Relevance (Score: 0.5)
Some Citations Off-Topic
Query:
"How does machine learning work?"
AI Response:
"Machine learning uses algorithms to learn from data:
- Neural networks mimic brain structure ML Textbook
- Training requires large datasets Data Science Blog
- My favorite coffee shop uses ML! Best Coffee
- Check out this unrelated video Cat Video"
Analysis:
| Citation | Verdict | Reason |
|---|---|---|
| ML Textbook | β | Directly about machine learning |
| Data Science Blog | β | Relevant to ML data requirements |
| Best Coffee | β | Coffee reviews unrelated to ML |
| Cat Video | β | Entertainment, not educational |
Final Score: 2 / 4 = 0.5
β Scenario 3: Mostly Irrelevant (Score: 0.25)
Citation Spam
Query:
"What is the capital of France?"
AI Response:
"Paris is the capital of France. Here are some links:
Analysis:
| Citation | Verdict | Reason |
|---|---|---|
| My Portfolio | β | Self-promotion, irrelevant |
| Buy Cheap Flights | β | Commercial, off-topic |
| Wikipedia - France | β | Relevant geographic source |
| Dating Site | β | Completely unrelated |
Final Score: 1 / 4 = 0.25
Why It Matters¶
Ensures AI-generated citations actually support the response, not random links or self-promotion.
Critical for academic and research tools where citations must be relevant and authoritative.
Users expect citations to be helpful. Irrelevant citations damage credibility and waste time.
Quick Reference¶
TL;DR
Citation Relevancy = Are the citations actually relevant to the user's question?
- Use it when: AI responses include citations that need quality validation
- Score interpretation: Higher = more citations are relevant
- Key feature: Supports multiple citation formats (URLs, DOIs, academic)
-
API Reference
-
Related Metrics
Faithfulness Β· Answer Relevancy Β· Factual Accuracy