Skip to content

Contextual Relevancy

Evaluate if retrieved context is relevant to the user's query
LLM-Powered Knowledge Single Turn Retrieval

At a Glance

🎯
Score Range
0.0 ──────── 1.0
Ratio of relevant chunks
⚑
Default Threshold
0.5
Pass/fail cutoff
πŸ“‹
Required Inputs
query retrieved_content
No answer required

What It Measures

Contextual Relevancy evaluates whether the retrieved context chunks are relevant to the user's query. It measures retrieval quality independent of generationβ€”answering: "Did we retrieve the right documents?"

Score Interpretation
1.0 All retrieved chunks are relevant
0.7+ Most chunks relevant, some noise
0.5 Mixed relevanceβ€”half helpful
< 0.5 Mostly irrelevant retrieval
βœ… Use When
  • Evaluating RAG retrieval quality
  • Tuning vector search parameters
  • Debugging poor answer quality
  • Comparing retrieval strategies
❌ Don't Use When
  • No retrieval component exists
  • Evaluating answer quality (use Faithfulness)
  • All chunks are from same document
  • Retrieval is keyword-based only

RAG Evaluation Suite

Contextual Relevancy asks: "Are the retrieved chunks relevant to the query?"

Related retrieval metrics:


How It Works

The metric evaluates each retrieved chunk's relevance to the query.

Step-by-Step Process

flowchart TD
    subgraph INPUT["πŸ“₯ Inputs"]
        A[Query]
        B[Retrieved Chunks]
    end

    subgraph EVALUATE["βš–οΈ Step 1: Relevancy Check"]
        C[RAGAnalyzer Engine]
        D1["Chunk 1: βœ“/βœ—"]
        D2["Chunk 2: βœ“/βœ—"]
        D3["Chunk 3: βœ“/βœ—"]
        DN["..."]
    end

    subgraph SCORE["πŸ“Š Step 2: Scoring"]
        E["Count Relevant Chunks"]
        F["Calculate Ratio"]
        G["Final Score"]
    end

    A & B --> C
    C --> D1 & D2 & D3 & DN
    D1 & D2 & D3 & DN --> E
    E --> F
    F --> G

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style EVALUATE stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style G fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

Each chunk receives a binary relevance verdict.

βœ… RELEVANT
1

Chunk contains information useful for answering the query.

❌ IRRELEVANT
0

Chunk is off-topic or doesn't help answer the query.

Score Formula

score = relevant_chunks / total_chunks

Configuration

Parameter Type Default Description
mode EvaluationMode GRANULAR Evaluation detail level

Shared Cache

Contextual Relevancy shares an internal cache with other contextual metrics. Running multiple retrieval metrics together is efficient.


Code Examples

from axion.metrics import ContextualRelevancy
from axion.dataset import DatasetItem

metric = ContextualRelevancy()

item = DatasetItem(
    query="What is the capital of France?",
    retrieved_content=[
        "Paris is the capital and largest city of France.",
        "France is known for its wine and cuisine.",
        "The Eiffel Tower was built in 1889.",
    ],
)

result = await metric.execute(item)
print(result.pretty())
# Score: 0.67 (2 of 3 chunks relevant)
from axion.metrics import ContextualRelevancy
from axion.runners import MetricRunner

metric = ContextualRelevancy()
runner = MetricRunner(metrics=[metric])

# Compare two retrieval strategies
results_v1 = await runner.run(dataset_with_bm25)
results_v2 = await runner.run(dataset_with_embeddings)

avg_v1 = sum(r.score for r in results_v1) / len(results_v1)
avg_v2 = sum(r.score for r in results_v2) / len(results_v2)

print(f"BM25 Relevancy: {avg_v1:.2f}")
print(f"Embedding Relevancy: {avg_v2:.2f}")
from axion.metrics import ContextualRelevancy
from axion.runners import MetricRunner

metric = ContextualRelevancy()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"Score: {item_result.score}")
    print(f"Relevant: {item_result.signals.relevant_chunks}/{item_result.signals.total_chunks}")
    for i, chunk in enumerate(item_result.signals.chunk_breakdown):
        status = "βœ…" if chunk.is_relevant else "❌"
        print(f"  {status} Chunk {i+1}: {chunk.chunk_text[:50]}...")

Metric Diagnostics

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was givenβ€”no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown
πŸ“Š ContextualRelevancyResult Structure
ContextualRelevancyResult(
{
    "relevancy_score": 0.67,
    "total_chunks": 3,
    "relevant_chunks": 2,
    "chunk_breakdown": [
        {
            "chunk_text": "Paris is the capital and largest city of France.",
            "is_relevant": true
        },
        {
            "chunk_text": "France is known for its wine and cuisine.",
            "is_relevant": false
        },
        {
            "chunk_text": "The Eiffel Tower was built in 1889.",
            "is_relevant": true
        }
    ]
}
)

Signal Fields

Field Type Description
relevancy_score float Ratio of relevant chunks (0.0-1.0)
total_chunks int Total chunks retrieved
relevant_chunks int Number of relevant chunks
chunk_breakdown List Per-chunk verdict details

Chunk Breakdown Fields

Field Type Description
chunk_text str The retrieved chunk content
is_relevant bool Whether chunk is relevant to query

Example Scenarios

βœ… Scenario 1: High Relevancy (Score: 1.0)

All Chunks Relevant

Query:

"How does photosynthesis work?"

Retrieved Chunks:

  1. "Photosynthesis converts light energy into chemical energy."
  2. "Plants use chlorophyll to absorb sunlight."
  3. "The process produces glucose and oxygen from CO2 and water."

Analysis:

Chunk Verdict
Light energy conversion βœ… Core concept
Chlorophyll absorption βœ… Key mechanism
Glucose/oxygen production βœ… Process outputs

Final Score: 3 / 3 = 1.0

⚠️ Scenario 2: Mixed Relevancy (Score: 0.5)

Retrieval Noise

Query:

"What are the symptoms of diabetes?"

Retrieved Chunks:

  1. "Diabetes symptoms include increased thirst and frequent urination."
  2. "Exercise is important for overall health."
  3. "Blurred vision and fatigue are common in diabetic patients."
  4. "Healthy eating includes fruits and vegetables."

Analysis:

Chunk Verdict
Thirst and urination βœ… Direct symptoms
Exercise importance ❌ General health, not symptoms
Blurred vision, fatigue βœ… Diabetes symptoms
Fruits and vegetables ❌ Diet info, not symptoms

Final Score: 2 / 4 = 0.5

❌ Scenario 3: Poor Relevancy (Score: 0.0)

Retrieval Failure

Query:

"What is quantum computing?"

Retrieved Chunks:

  1. "Classical computers use binary bits."
  2. "The internet was invented in the 1960s."
  3. "Programming languages include Python and Java."

Analysis:

Chunk Verdict
Binary bits ❌ Classical computing, not quantum
Internet history ❌ Completely off-topic
Programming languages ❌ Unrelated to quantum concepts

Final Score: 0 / 3 = 0.0

Retrieval completely failed to find quantum computing content.


Why It Matters

πŸ” Retrieval Quality

Identifies when your retrieval system returns irrelevant documents, causing poor answer quality.

🎯 Debug Isolation

Separates retrieval problems from generation problems. Low relevancy = fix retrieval, not the LLM.

⚑ Efficiency

Irrelevant chunks waste context window space and can confuse the generator.


Quick Reference

TL;DR

Contextual Relevancy = Are the retrieved chunks relevant to the query?

  • Use it when: Evaluating or tuning RAG retrieval
  • Score interpretation: Higher = more relevant retrieval
  • Key insight: Measures retrieval, not generation