Skip to content

Citation Relevancy

Measure the quality and relevance of citations in AI responses
LLM-Powered Knowledge Multi-Turn Citation

At a Glance

🎯
Score Range
0.0 ──────── 1.0
Ratio of relevant citations
⚑
Default Threshold
0.8
High bar for citation quality
πŸ“‹
Required Inputs
query actual_output
Optional: conversation

What It Measures

Citation Relevancy evaluates whether the citations included in an AI response are actually relevant to the user's query. It extracts citations using pattern matching, then judges each citation's relevance using an LLM. Essential for research assistants and fact-checking systems.

Score Interpretation
1.0 All citations directly relevant to query
0.8+ Most citations relevant, minor tangents
0.5 Mixed relevanceβ€”some helpful, some off-topic
< 0.5 Mostly irrelevant or unrelated citations
βœ… Use When
  • Building research assistants
  • Fact-checking systems
  • Academic writing tools
  • Any system that generates citations
❌ Don't Use When
  • Responses don't include citations
  • Citation format is non-standard
  • Internal linking (not external sources)
  • Pure conversational AI

See Also: Faithfulness

Citation Relevancy checks if cited sources are relevant to the query. Faithfulness checks if claims are grounded in retrieved context.

Use Citation Relevancy for output validation; use Faithfulness for RAG grounding.


How It Works

The metric uses regex-based extraction followed by LLM-based relevance judgment.

Step-by-Step Process

flowchart TD
    subgraph INPUT["πŸ“₯ Inputs"]
        A[Query]
        B[AI Response with Citations]
    end

    subgraph EXTRACT["πŸ” Step 1: Citation Extraction"]
        C[Regex Pattern Matching]
        D["Extracted Citations"]
    end

    subgraph JUDGE["βš–οΈ Step 2: Relevance Judgment"]
        E[CitationRelevanceJudge LLM]
        F["Verdict per Citation"]
    end

    subgraph SCORE["πŸ“Š Step 3: Scoring"]
        G["Count Relevant"]
        H["Calculate Ratio"]
        I["Final Score"]
    end

    A & B --> C
    C --> D
    D --> E
    A --> E
    E --> F
    F --> G
    G --> H
    H --> I

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style EXTRACT stroke:#3b82f6,stroke-width:2px
    style JUDGE stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

The metric extracts citations using multiple regex patterns:

πŸ“ Markdown Links
Title

πŸ”— HTTP/HTTPS URLs
https://example.com/article

🌐 WWW URLs
www.example.com/page

πŸ“š DOI Patterns
doi:10.1234/example

πŸŽ“ Academic Format
(Smith et al., 2023) or (Smith, 2023)

Each citation receives a binary relevance verdict.

βœ… RELEVANT
1

Citation directly supports answering the user's query.

❌ IRRELEVANT
0

Citation is off-topic or doesn't help answer the question.

Score Formula

score = relevant_citations / total_citations

Configuration

Parameter Type Default Description
multi_turn_strategy 'last_turn' | 'all_turns' 'last_turn' How to evaluate conversations
mode EvaluationMode GRANULAR Evaluation detail level

Multi-Turn Support

In multi-turn conversations, citations are associated with their corresponding query context:

  • last_turn: Only evaluates citations in the final response
  • all_turns: Evaluates citations across all turns, matching each to its original query

Code Examples

from axion.metrics import CitationRelevancy
from axion.dataset import DatasetItem

metric = CitationRelevancy()

item = DatasetItem(
    query="What are the health benefits of green tea?",
    actual_output="""
    Green tea has numerous health benefits:

    1. Rich in antioxidants [Source](https://healthline.com/green-tea-benefits)
    2. May improve brain function (Smith et al., 2020)
    3. Great for parties! [Party Guide](https://party-planning.com)
    """,
)

result = await metric.execute(item)
print(result.pretty())
# Score: 0.67 (2 of 3 citations relevant)
from axion.metrics import CitationRelevancy
from axion.dataset import DatasetItem, MultiTurnConversation
from axion.schema import HumanMessage, AIMessage

conversation = MultiTurnConversation(messages=[
    HumanMessage(content="What causes climate change?"),
    AIMessage(content="Climate change is primarily caused by greenhouse gases. [IPCC Report](https://ipcc.ch/report)"),
    HumanMessage(content="How can I reduce my carbon footprint?"),
    AIMessage(content="You can reduce emissions by using public transport. [EPA Guide](https://epa.gov/guide)"),
])

metric = CitationRelevancy(multi_turn_strategy='all_turns')
item = DatasetItem(conversation=conversation)

result = await metric.execute(item)
print(f"Evaluated {result.signals.total_citations} citations across turns")
from axion.metrics import CitationRelevancy
from axion.runners import MetricRunner

metric = CitationRelevancy()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"Score: {item_result.score}")
    print(f"Relevant: {item_result.signals.relevant_citations_count}/{item_result.signals.total_citations}")
    for citation in item_result.signals.citation_breakdown:
        status = "βœ…" if citation.relevance_verdict else "❌"
        print(f"  {status} {citation.citation_text[:50]}...")

Metric Diagnostics

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was givenβ€”no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown
πŸ“Š CitationRelevancyResult Structure
CitationRelevancyResult(
{
    "relevance_score": 0.67,
    "total_citations": 3,
    "relevant_citations_count": 2,
    "irrelevant_citations_count": 1,
    "citation_breakdown": [
        {
            "citation_text": "[Source](https://healthline.com/green-tea-benefits)",
            "relevance_verdict": true,
            "relevance_reason": "Directly addresses health benefits of green tea",
            "turn_index": 0,
            "original_query": "What are the health benefits of green tea?"
        },
        {
            "citation_text": "(Smith et al., 2020)",
            "relevance_verdict": true,
            "relevance_reason": "Academic source on tea and brain function",
            "turn_index": 0,
            "original_query": "What are the health benefits of green tea?"
        },
        {
            "citation_text": "[Party Guide](https://party-planning.com)",
            "relevance_verdict": false,
            "relevance_reason": "Party planning is unrelated to health benefits",
            "turn_index": 0,
            "original_query": "What are the health benefits of green tea?"
        }
    ]
}
)

Signal Fields

Field Type Description
relevance_score float Ratio of relevant citations (0.0-1.0)
total_citations int Total citations extracted
relevant_citations_count int Count of relevant citations
irrelevant_citations_count int Count of irrelevant citations
citation_breakdown List Per-citation verdict details

Citation Breakdown Fields

Field Type Description
citation_text str The extracted citation
relevance_verdict bool Whether citation is relevant
relevance_reason str Explanation for verdict
turn_index int Conversation turn (for multi-turn)
original_query str Query that prompted this citation

Example Scenarios

βœ… Scenario 1: All Citations Relevant (Score: 1.0)

High-Quality Citations

Query:

"What are effective treatments for type 2 diabetes?"

AI Response:

"Type 2 diabetes can be managed through:

  1. Lifestyle changes including diet and exercise ADA Guidelines
  2. Metformin as first-line medication (Johnson et al., 2021)
  3. Regular blood glucose monitoring CDC Recommendations"

Analysis:

Citation Verdict Reason
ADA Guidelines βœ… Authoritative diabetes organization
Johnson et al., 2021 βœ… Academic source on medication
CDC Recommendations βœ… Government health authority

Final Score: 3 / 3 = 1.0

⚠️ Scenario 2: Mixed Relevance (Score: 0.5)

Some Citations Off-Topic

Query:

"How does machine learning work?"

AI Response:

"Machine learning uses algorithms to learn from data:

Analysis:

Citation Verdict Reason
ML Textbook βœ… Directly about machine learning
Data Science Blog βœ… Relevant to ML data requirements
Best Coffee ❌ Coffee reviews unrelated to ML
Cat Video ❌ Entertainment, not educational

Final Score: 2 / 4 = 0.5

❌ Scenario 3: Mostly Irrelevant (Score: 0.25)

Citation Spam

Query:

"What is the capital of France?"

AI Response:

"Paris is the capital of France. Here are some links:

Analysis:

Citation Verdict Reason
My Portfolio ❌ Self-promotion, irrelevant
Buy Cheap Flights ❌ Commercial, off-topic
Wikipedia - France βœ… Relevant geographic source
Dating Site ❌ Completely unrelated

Final Score: 1 / 4 = 0.25


Why It Matters

πŸ” Source Quality

Ensures AI-generated citations actually support the response, not random links or self-promotion.

πŸŽ“ Research Integrity

Critical for academic and research tools where citations must be relevant and authoritative.

βœ… User Trust

Users expect citations to be helpful. Irrelevant citations damage credibility and waste time.


Quick Reference

TL;DR

Citation Relevancy = Are the citations actually relevant to the user's question?

  • Use it when: AI responses include citations that need quality validation
  • Score interpretation: Higher = more citations are relevant
  • Key feature: Supports multiple citation formats (URLs, DOIs, academic)