Citation Relevancy¶

Measure the quality and relevance of citations in AI responses
LLM-Powered Knowledge Multi-Turn Citation

At a Glance¶

🎯
Score Range
0.0 ──────── 1.0
Ratio of relevant citations

⚡
Default Threshold
0.8
High bar for citation quality

📋
Required Inputs
query actual_output
Optional: conversation

What It Measures

Citation Relevancy evaluates whether the citations included in an AI response are actually relevant to the user's query. It extracts citations using pattern matching, then judges each citation's relevance using an LLM. Essential for research assistants and fact-checking systems.

Score	Interpretation
1.0	All citations directly relevant to query
0.8+	Most citations relevant, minor tangents
0.5	Mixed relevance—some helpful, some off-topic
< 0.5	Mostly irrelevant or unrelated citations

✅ Use When

Building research assistants
Fact-checking systems
Academic writing tools
Any system that generates citations

❌ Don't Use When

Responses don't include citations
Citation format is non-standard
Internal linking (not external sources)
Pure conversational AI

Step-by-Step Process¶

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Query]
        B[AI Response with Citations]
    end

    subgraph EXTRACT["🔍 Step 1: Citation Extraction"]
        C[Regex Pattern Matching]
        D["Extracted Citations"]
    end

    subgraph JUDGE["⚖️ Step 2: Relevance Judgment"]
        E[CitationRelevanceJudge LLM]
        F["Verdict per Citation"]
    end

    subgraph SCORE["📊 Step 3: Scoring"]
        G["Count Relevant"]
        H["Calculate Ratio"]
        I["Final Score"]
    end

    A & B --> C
    C --> D
    D --> E
    A --> E
    E --> F
    F --> G
    G --> H
    H --> I

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style EXTRACT stroke:#3b82f6,stroke-width:2px
    style JUDGE stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

The metric extracts citations using multiple regex patterns:

📝 Markdown Links
Title

🔗 HTTP/HTTPS URLs
https://example.com/article

🌐 WWW URLs
www.example.com/page

📚 DOI Patterns
doi:10.1234/example

🎓 Academic Format
(Smith et al., 2023) or (Smith, 2023)

Each citation receives a binary relevance verdict.

✅ RELEVANT

1

Citation directly supports answering the user's query.

❌ IRRELEVANT

0

Citation is off-topic or doesn't help answer the question.

Score Formula

score = relevant_citations / total_citations

Configuration¶

Parameters

Parameter	Type	Default	Description
`multi_turn_strategy`	`'last_turn'` \| `'all_turns'`	`'last_turn'`	How to evaluate conversations
`mode`	`EvaluationMode`	`GRANULAR`	Evaluation detail level

Multi-Turn Support

In multi-turn conversations, citations are associated with their corresponding query context:

last_turn: Only evaluates citations in the final response
all_turns: Evaluates citations across all turns, matching each to its original query

Code Examples¶

Basic Usage Multi-Turn With Runner

from axion.metrics import CitationRelevancy
from axion.dataset import DatasetItem

metric = CitationRelevancy()

item = DatasetItem(
    query="What are the health benefits of green tea?",
    actual_output="""
    Green tea has numerous health benefits:

    1. Rich in antioxidants [Source](https://healthline.com/green-tea-benefits)
    2. May improve brain function (Smith et al., 2020)
    3. Great for parties! [Party Guide](https://party-planning.com)
    """,
)

result = await metric.execute(item)
print(result.pretty())
# Score: 0.67 (2 of 3 citations relevant)

from axion.metrics import CitationRelevancy
from axion.dataset import DatasetItem, MultiTurnConversation
from axion.schema import HumanMessage, AIMessage

conversation = MultiTurnConversation(messages=[
    HumanMessage(content="What causes climate change?"),
    AIMessage(content="Climate change is primarily caused by greenhouse gases. [IPCC Report](https://ipcc.ch/report)"),
    HumanMessage(content="How can I reduce my carbon footprint?"),
    AIMessage(content="You can reduce emissions by using public transport. [EPA Guide](https://epa.gov/guide)"),
])

metric = CitationRelevancy(multi_turn_strategy='all_turns')
item = DatasetItem(conversation=conversation)

result = await metric.execute(item)
print(f"Evaluated {result.signals.total_citations} citations across turns")

from axion.metrics import CitationRelevancy
from axion.runners import MetricRunner

metric = CitationRelevancy()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"Score: {item_result.score}")
    print(f"Relevant: {item_result.signals.relevant_citations_count}/{item_result.signals.total_citations}")
    for citation in item_result.signals.citation_breakdown:
        status = "✅" if citation.relevance_verdict else "❌"
        print(f"  {status} {citation.citation_text[:50]}...")

Metric Diagnostics¶

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown

📊 CitationRelevancyResult Structure

CitationRelevancyResult(
{
    "relevance_score": 0.67,
    "total_citations": 3,
    "relevant_citations_count": 2,
    "irrelevant_citations_count": 1,
    "citation_breakdown": [
        {
            "citation_text": "[Source](https://healthline.com/green-tea-benefits)",
            "relevance_verdict": true,
            "relevance_reason": "Directly addresses health benefits of green tea",
            "turn_index": 0,
            "original_query": "What are the health benefits of green tea?"
        },
        {
            "citation_text": "(Smith et al., 2020)",
            "relevance_verdict": true,
            "relevance_reason": "Academic source on tea and brain function",
            "turn_index": 0,
            "original_query": "What are the health benefits of green tea?"
        },
        {
            "citation_text": "[Party Guide](https://party-planning.com)",
            "relevance_verdict": false,
            "relevance_reason": "Party planning is unrelated to health benefits",
            "turn_index": 0,
            "original_query": "What are the health benefits of green tea?"
        }
    ]
}
)

Signal Fields¶

Field	Type	Description
`relevance_score`	`float`	Ratio of relevant citations (0.0-1.0)
`total_citations`	`int`	Total citations extracted
`relevant_citations_count`	`int`	Count of relevant citations
`irrelevant_citations_count`	`int`	Count of irrelevant citations
`citation_breakdown`	`List`	Per-citation verdict details

Citation Breakdown Fields¶

Field	Type	Description
`citation_text`	`str`	The extracted citation
`relevance_verdict`	`bool`	Whether citation is relevant
`relevance_reason`	`str`	Explanation for verdict
`turn_index`	`int`	Conversation turn (for multi-turn)
`original_query`	`str`	Query that prompted this citation

Example Scenarios¶

✅ Scenario 1: All Citations Relevant (Score: 1.0)

High-Quality Citations

Query:

"What are effective treatments for type 2 diabetes?"

AI Response:

"Type 2 diabetes can be managed through:

Lifestyle changes including diet and exercise ADA Guidelines

Metformin as first-line medication (Johnson et al., 2021)

Regular blood glucose monitoring CDC Recommendations"

Analysis:

Citation	Verdict	Reason
ADA Guidelines	✅	Authoritative diabetes organization
Johnson et al., 2021	✅	Academic source on medication
CDC Recommendations	✅	Government health authority

Final Score: 3 / 3 = 1.0

⚠️ Scenario 2: Mixed Relevance (Score: 0.5)

Some Citations Off-Topic

Query:

"How does machine learning work?"

AI Response:

"Machine learning uses algorithms to learn from data:

Neural networks mimic brain structure ML Textbook

Training requires large datasets Data Science Blog

My favorite coffee shop uses ML! Best Coffee

Check out this unrelated video Cat Video"

Analysis:

Citation	Verdict	Reason
ML Textbook	✅	Directly about machine learning
Data Science Blog	✅	Relevant to ML data requirements
Best Coffee	❌	Coffee reviews unrelated to ML
Cat Video	❌	Entertainment, not educational

Final Score: 2 / 4 = 0.5

❌ Scenario 3: Mostly Irrelevant (Score: 0.25)

Citation Spam

Query:

"What is the capital of France?"

AI Response:

"Paris is the capital of France. Here are some links:

My Portfolio

Buy Cheap Flights

Wikipedia - France

Dating Site"

Analysis:

Citation	Verdict	Reason
My Portfolio	❌	Self-promotion, irrelevant
Buy Cheap Flights	❌	Commercial, off-topic
Wikipedia - France	✅	Relevant geographic source
Dating Site	❌	Completely unrelated

Final Score: 1 / 4 = 0.25

Why It Matters¶

🔍 Source Quality

Ensures AI-generated citations actually support the response, not random links or self-promotion.

🎓 Research Integrity

Critical for academic and research tools where citations must be relevant and authoritative.

✅ User Trust

Users expect citations to be helpful. Irrelevant citations damage credibility and waste time.

Quick Reference¶

TL;DR

Citation Relevancy = Are the citations actually relevant to the user's question?

Use it when: AI responses include citations that need quality validation
Score interpretation: Higher = more citations are relevant
Key feature: Supports multiple citation formats (URLs, DOIs, academic)

API Reference

axion.metrics.CitationRelevancy
Related Metrics

Faithfulness · Answer Relevancy · Factual Accuracy