Contextual Ranking¶
LLM-Powered Knowledge Single Turn Retrieval
At a Glance¶
Score Range
0.0 ──────── 1.0Precision-weighted ranking score
Default Threshold
0.5Pass/fail cutoff
Required Inputs
query retrieved_contentNo expected_output needed
What It Measures
Contextual Ranking evaluates whether relevant chunks are positioned higher in retrieval results. Unlike Contextual Precision (which checks usefulness for generating an answer), Ranking simply checks query relevance—making it usable without ground truth.
| Score | Interpretation |
|---|---|
| ≥ 0.9 | Excellent ranking—relevant chunks at top |
| ≥ 0.7 | Good ranking quality |
| 0.5 | Mediocre—relevant chunks scattered |
| < 0.5 | Poor ranking—relevant chunks buried |
- No expected_output available
- Evaluating retrieval ranking
- Comparing re-ranking algorithms
- Testing search relevance
- You have expected_output (use Precision)
- Chunk order doesn't matter
- Single-chunk retrieval
- Evaluating answer quality
Ranking vs Precision
Contextual Ranking checks: "Are relevant chunks ranked higher?" (based on query) Contextual Precision checks: "Are useful chunks ranked higher?" (based on expected answer)
Use Ranking when you don't have ground truth; use Precision when you do.
How It Works
The metric evaluates chunk relevance to the query, then calculates a precision-weighted ranking score.
Step-by-Step Process¶
flowchart TD
subgraph INPUT["📥 Inputs"]
A[Query]
B[Retrieved Chunks in Order]
end
subgraph EVALUATE["⚖️ Step 1: Relevancy Check"]
C[RAGAnalyzer Engine]
D1["Chunk 1: R/âś—"]
D2["Chunk 2: R/âś—"]
D3["Chunk 3: R/âś—"]
DN["..."]
end
subgraph RANK["📊 Step 2: Calculate Ranking Score"]
E["For each relevant chunk at position k"]
F["Precision@k = relevant_seen / k"]
G["Sum all Precision@k values"]
H["Divide by total relevant chunks"]
I["Final Score"]
end
A & B --> C
C --> D1 & D2 & D3 & DN
D1 & D2 & D3 & DN --> E
E --> F
F --> G
G --> H
H --> I
style INPUT stroke:#1E3A5F,stroke-width:2px
style EVALUATE stroke:#f59e0b,stroke-width:2px
style RANK stroke:#10b981,stroke-width:2px
style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff
The score heavily penalizes relevant chunks ranked low.
Example with 5 chunks (R = relevant, X = not relevant):
Position: 1 2 3 4 5
Chunks: [R] [X] [R] [X] [R]
Precision@1 = 1/1 = 1.0 (first relevant at position 1)
Precision@3 = 2/3 = 0.67 (second relevant at position 3)
Precision@5 = 3/5 = 0.6 (third relevant at position 5)
Score = (1.0 + 0.67 + 0.6) / 3 = 0.76
Chunk is relevant to the query. Contributes to ranking score.
Chunk is off-topic. Dilutes precision at each position.
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
mode |
EvaluationMode |
GRANULAR |
Evaluation detail level |
Interpretation Guide
| Score Range | Quality | Recommendation |
|---|---|---|
| ≥ 0.9 | Excellent | Ranking is optimal |
| ≥ 0.7 | Good | Acceptable for most use cases |
| < 0.7 | Poor | Consider improving re-ranking |
| 0.0 | None | No relevant chunks found |
Code Examples¶
from axion.metrics import ContextualRanking
from axion.dataset import DatasetItem
metric = ContextualRanking()
item = DatasetItem(
query="What is machine learning?",
retrieved_content=[
"Machine learning is a subset of AI.", # Relevant
"Python is a programming language.", # Not relevant
"ML models learn from data.", # Relevant
"The weather is nice today.", # Not relevant
],
)
result = await metric.execute(item)
print(result.pretty())
# Score: (1/1 + 2/3) / 2 = 0.83
from axion.metrics import ContextualRanking
metric = ContextualRanking()
# Good ranking: relevant first
good_order = DatasetItem(
query="Benefits of exercise",
retrieved_content=[
"Exercise improves cardiovascular health.", # Relevant
"Regular workouts boost energy levels.", # Relevant
"Cooking is a useful skill.", # Not relevant
],
)
# Score: (1/1 + 2/2) / 2 = 1.0
# Bad ranking: relevant last
bad_order = DatasetItem(
query="Benefits of exercise",
retrieved_content=[
"Cooking is a useful skill.", # Not relevant
"Exercise improves cardiovascular health.", # Relevant
"Regular workouts boost energy levels.", # Relevant
],
)
# Score: (1/2 + 2/3) / 2 = 0.58
from axion.metrics import ContextualRanking
from axion.runners import MetricRunner
metric = ContextualRanking()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)
for item_result in results:
print(f"Ranking Score: {item_result.score}")
print(f"Relevant: {item_result.signals.relevant_chunks}/{item_result.signals.total_chunks}")
for i, chunk in enumerate(item_result.signals.chunk_breakdown):
status = "✅" if chunk.is_relevant else "❌"
print(f" {i+1}. {status} {chunk.chunk_text[:40]}...")
Metric Diagnostics¶
Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.
result = await metric.execute(item)
print(result.pretty()) # Human-readable summary
result.signals # Full diagnostic breakdown
📊 ContextualRankingResult Structure
ContextualRankingResult(
{
"final_score": 0.83,
"relevant_chunks": 2,
"total_chunks": 4,
"chunk_breakdown": [
{
"chunk_text": "Machine learning is a subset of AI.",
"is_relevant": true
},
{
"chunk_text": "Python is a programming language.",
"is_relevant": false
},
{
"chunk_text": "ML models learn from data.",
"is_relevant": true
},
{
"chunk_text": "The weather is nice today.",
"is_relevant": false
}
]
}
)
Signal Fields¶
| Field | Type | Description |
|---|---|---|
final_score |
float |
Ranking score (0.0-1.0) |
relevant_chunks |
int |
Number of relevant chunks |
total_chunks |
int |
Total chunks retrieved |
chunk_breakdown |
List |
Per-chunk verdict details |
Chunk Breakdown Fields¶
| Field | Type | Description |
|---|---|---|
chunk_text |
str |
The retrieved chunk content |
is_relevant |
bool |
Whether chunk is relevant to query |
Example Scenarios¶
âś… Scenario 1: Excellent Ranking (Score: 1.0)
All Relevant Chunks First
Query:
"How does photosynthesis work?"
Retrieved Context (in order):
- "Photosynthesis converts light energy into chemical energy." âś…
- "Plants use chlorophyll to absorb sunlight." âś…
- "Photosynthesis produces glucose and oxygen." âś…
- "The ocean covers 71% of Earth." ❌
- "Volcanic eruptions release gases." ❌
Ranking Calculation:
Relevant at positions: 1, 2, 3
P@1 = 1/1 = 1.0
P@2 = 2/2 = 1.0
P@3 = 3/3 = 1.0
Score = (1.0 + 1.0 + 1.0) / 3 = 1.0
Final Score: 1.0
⚠️ Scenario 2: Mediocre Ranking (Score: 0.5)
Relevant Chunks Scattered
Query:
"What are the benefits of meditation?"
Retrieved Context (in order):
- "Yoga is an ancient practice." ❌
- "Meditation reduces stress and anxiety." âś…
- "Cooking can be therapeutic." ❌
- "Mindfulness improves focus." âś…
Ranking Calculation:
Final Score: 0.5
Relevant content not prioritized at top positions.
❌ Scenario 3: Poor Ranking (Score: 0.33)
Relevant Chunks at Bottom
Query:
"What is the capital of Japan?"
Retrieved Context (in order):
- "Japan has a population of 125 million." ❌
- "Japanese cuisine includes sushi." ❌
- "Tokyo is the capital of Japan." âś…
Ranking Calculation:
Final Score: 0.33
The only relevant chunk is last—poor ranking quality.
Why It Matters¶
Evaluate ranking quality without expected answers—ideal for production monitoring.
Directly measures whether your re-ranking model improves result ordering.
When using top-k results, good ranking ensures the best content is included.
Quick Reference¶
TL;DR
Contextual Ranking = Are relevant chunks positioned at the top of results?
- Use it when: Evaluating retrieval ranking without ground truth
- Score interpretation: Higher = relevant chunks appear earlier
- Key difference: Uses query relevance, not answer usefulness
-
API Reference
-
Related Metrics
Contextual Precision · Contextual Relevancy · Faithfulness