Contextual Relevancy¶
LLM-Powered Knowledge Single Turn Retrieval
At a Glance¶
Score Range
0.0 ββββββββ 1.0Ratio of relevant chunks
Default Threshold
0.5Pass/fail cutoff
Required Inputs
query retrieved_contentNo answer required
What It Measures
Contextual Relevancy evaluates whether the retrieved context chunks are relevant to the user's query. It measures retrieval quality independent of generationβanswering: "Did we retrieve the right documents?"
| Score | Interpretation |
|---|---|
| 1.0 | All retrieved chunks are relevant |
| 0.7+ | Most chunks relevant, some noise |
| 0.5 | Mixed relevanceβhalf helpful |
| < 0.5 | Mostly irrelevant retrieval |
- Evaluating RAG retrieval quality
- Tuning vector search parameters
- Debugging poor answer quality
- Comparing retrieval strategies
- No retrieval component exists
- Evaluating answer quality (use Faithfulness)
- All chunks are from same document
- Retrieval is keyword-based only
RAG Evaluation Suite
Contextual Relevancy asks: "Are the retrieved chunks relevant to the query?"
Related retrieval metrics:
- Contextual Precision: Are relevant chunks ranked higher?
- Contextual Recall: Do chunks cover the expected answer?
- Contextual Sufficiency: Is there enough info to answer?
How It Works
The metric evaluates each retrieved chunk's relevance to the query.
Step-by-Step Process¶
flowchart TD
subgraph INPUT["π₯ Inputs"]
A[Query]
B[Retrieved Chunks]
end
subgraph EVALUATE["βοΈ Step 1: Relevancy Check"]
C[RAGAnalyzer Engine]
D1["Chunk 1: β/β"]
D2["Chunk 2: β/β"]
D3["Chunk 3: β/β"]
DN["..."]
end
subgraph SCORE["π Step 2: Scoring"]
E["Count Relevant Chunks"]
F["Calculate Ratio"]
G["Final Score"]
end
A & B --> C
C --> D1 & D2 & D3 & DN
D1 & D2 & D3 & DN --> E
E --> F
F --> G
style INPUT stroke:#1E3A5F,stroke-width:2px
style EVALUATE stroke:#f59e0b,stroke-width:2px
style SCORE stroke:#10b981,stroke-width:2px
style G fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
mode |
EvaluationMode |
GRANULAR |
Evaluation detail level |
Shared Cache
Contextual Relevancy shares an internal cache with other contextual metrics. Running multiple retrieval metrics together is efficient.
Code Examples¶
from axion.metrics import ContextualRelevancy
from axion.dataset import DatasetItem
metric = ContextualRelevancy()
item = DatasetItem(
query="What is the capital of France?",
retrieved_content=[
"Paris is the capital and largest city of France.",
"France is known for its wine and cuisine.",
"The Eiffel Tower was built in 1889.",
],
)
result = await metric.execute(item)
print(result.pretty())
# Score: 0.67 (2 of 3 chunks relevant)
from axion.metrics import ContextualRelevancy
from axion.runners import MetricRunner
metric = ContextualRelevancy()
runner = MetricRunner(metrics=[metric])
# Compare two retrieval strategies
results_v1 = await runner.run(dataset_with_bm25)
results_v2 = await runner.run(dataset_with_embeddings)
avg_v1 = sum(r.score for r in results_v1) / len(results_v1)
avg_v2 = sum(r.score for r in results_v2) / len(results_v2)
print(f"BM25 Relevancy: {avg_v1:.2f}")
print(f"Embedding Relevancy: {avg_v2:.2f}")
from axion.metrics import ContextualRelevancy
from axion.runners import MetricRunner
metric = ContextualRelevancy()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)
for item_result in results:
print(f"Score: {item_result.score}")
print(f"Relevant: {item_result.signals.relevant_chunks}/{item_result.signals.total_chunks}")
for i, chunk in enumerate(item_result.signals.chunk_breakdown):
status = "β
" if chunk.is_relevant else "β"
print(f" {status} Chunk {i+1}: {chunk.chunk_text[:50]}...")
Metric Diagnostics¶
Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was givenβno black boxes.
result = await metric.execute(item)
print(result.pretty()) # Human-readable summary
result.signals # Full diagnostic breakdown
π ContextualRelevancyResult Structure
ContextualRelevancyResult(
{
"relevancy_score": 0.67,
"total_chunks": 3,
"relevant_chunks": 2,
"chunk_breakdown": [
{
"chunk_text": "Paris is the capital and largest city of France.",
"is_relevant": true
},
{
"chunk_text": "France is known for its wine and cuisine.",
"is_relevant": false
},
{
"chunk_text": "The Eiffel Tower was built in 1889.",
"is_relevant": true
}
]
}
)
Signal Fields¶
| Field | Type | Description |
|---|---|---|
relevancy_score |
float |
Ratio of relevant chunks (0.0-1.0) |
total_chunks |
int |
Total chunks retrieved |
relevant_chunks |
int |
Number of relevant chunks |
chunk_breakdown |
List |
Per-chunk verdict details |
Chunk Breakdown Fields¶
| Field | Type | Description |
|---|---|---|
chunk_text |
str |
The retrieved chunk content |
is_relevant |
bool |
Whether chunk is relevant to query |
Example Scenarios¶
β Scenario 1: High Relevancy (Score: 1.0)
All Chunks Relevant
Query:
"How does photosynthesis work?"
Retrieved Chunks:
- "Photosynthesis converts light energy into chemical energy."
- "Plants use chlorophyll to absorb sunlight."
- "The process produces glucose and oxygen from CO2 and water."
Analysis:
| Chunk | Verdict |
|---|---|
| Light energy conversion | β Core concept |
| Chlorophyll absorption | β Key mechanism |
| Glucose/oxygen production | β Process outputs |
Final Score: 3 / 3 = 1.0
β οΈ Scenario 2: Mixed Relevancy (Score: 0.5)
Retrieval Noise
Query:
"What are the symptoms of diabetes?"
Retrieved Chunks:
- "Diabetes symptoms include increased thirst and frequent urination."
- "Exercise is important for overall health."
- "Blurred vision and fatigue are common in diabetic patients."
- "Healthy eating includes fruits and vegetables."
Analysis:
| Chunk | Verdict |
|---|---|
| Thirst and urination | β Direct symptoms |
| Exercise importance | β General health, not symptoms |
| Blurred vision, fatigue | β Diabetes symptoms |
| Fruits and vegetables | β Diet info, not symptoms |
Final Score: 2 / 4 = 0.5
β Scenario 3: Poor Relevancy (Score: 0.0)
Retrieval Failure
Query:
"What is quantum computing?"
Retrieved Chunks:
- "Classical computers use binary bits."
- "The internet was invented in the 1960s."
- "Programming languages include Python and Java."
Analysis:
| Chunk | Verdict |
|---|---|
| Binary bits | β Classical computing, not quantum |
| Internet history | β Completely off-topic |
| Programming languages | β Unrelated to quantum concepts |
Final Score: 0 / 3 = 0.0
Retrieval completely failed to find quantum computing content.
Why It Matters¶
Identifies when your retrieval system returns irrelevant documents, causing poor answer quality.
Separates retrieval problems from generation problems. Low relevancy = fix retrieval, not the LLM.
Irrelevant chunks waste context window space and can confuse the generator.
Quick Reference¶
TL;DR
Contextual Relevancy = Are the retrieved chunks relevant to the query?
- Use it when: Evaluating or tuning RAG retrieval
- Score interpretation: Higher = more relevant retrieval
- Key insight: Measures retrieval, not generation
-
API Reference
-
Related Metrics
Contextual Precision Β· Contextual Recall Β· Faithfulness