Contextual Ranking¶

Evaluate if relevant context chunks are ranked higher
LLM-Powered Knowledge Single Turn Retrieval

At a Glance¶

🎯
Score Range
0.0 ──────── 1.0
Precision-weighted ranking score

⚡
Default Threshold
0.5
Pass/fail cutoff

📋
Required Inputs
query retrieved_content
No expected_output needed

What It Measures

Contextual Ranking evaluates whether relevant chunks are positioned higher in retrieval results. Unlike Contextual Precision (which checks usefulness for generating an answer), Ranking simply checks query relevance—making it usable without ground truth.

Score	Interpretation
≥ 0.9	Excellent ranking—relevant chunks at top
≥ 0.7	Good ranking quality
0.5	Mediocre—relevant chunks scattered
< 0.5	Poor ranking—relevant chunks buried

✅ Use When

No expected_output available
Evaluating retrieval ranking
Comparing re-ranking algorithms
Testing search relevance

❌ Don't Use When

You have expected_output (use Precision)
Chunk order doesn't matter
Single-chunk retrieval
Evaluating answer quality

Ranking vs Precision

Contextual Ranking checks: "Are relevant chunks ranked higher?" (based on query) Contextual Precision checks: "Are useful chunks ranked higher?" (based on expected answer)

Use Ranking when you don't have ground truth; use Precision when you do.

How It Works

Computation Ranking Calculation

The metric evaluates chunk relevance to the query, then calculates a precision-weighted ranking score.

Step-by-Step Process¶

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Query]
        B[Retrieved Chunks in Order]
    end

    subgraph EVALUATE["⚖️ Step 1: Relevancy Check"]
        C[RAGAnalyzer Engine]
        D1["Chunk 1: R/✗"]
        D2["Chunk 2: R/✗"]
        D3["Chunk 3: R/✗"]
        DN["..."]
    end

    subgraph RANK["📊 Step 2: Calculate Ranking Score"]
        E["For each relevant chunk at position k"]
        F["Precision@k = relevant_seen / k"]
        G["Sum all Precision@k values"]
        H["Divide by total relevant chunks"]
        I["Final Score"]
    end

    A & B --> C
    C --> D1 & D2 & D3 & DN
    D1 & D2 & D3 & DN --> E
    E --> F
    F --> G
    G --> H
    H --> I

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style EVALUATE stroke:#f59e0b,stroke-width:2px
    style RANK stroke:#10b981,stroke-width:2px
    style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

The score heavily penalizes relevant chunks ranked low.

Example with 5 chunks (R = relevant, X = not relevant):

Position:  1    2    3    4    5
Chunks:   [R]  [X]  [R]  [X]  [R]

Precision@1 = 1/1 = 1.0   (first relevant at position 1)
Precision@3 = 2/3 = 0.67  (second relevant at position 3)
Precision@5 = 3/5 = 0.6   (third relevant at position 5)

Score = (1.0 + 0.67 + 0.6) / 3 = 0.76

✅ RELEVANT

+P@k

Chunk is relevant to the query. Contributes to ranking score.

❌ NOT RELEVANT

0

Chunk is off-topic. Dilutes precision at each position.

Score Formula

score = sum(Precision@k for each relevant chunk) / total_relevant_chunks
score = clamp(score, 0.0, 1.0)

Configuration¶

Parameters

Parameter	Type	Default	Description
`mode`	`EvaluationMode`	`GRANULAR`	Evaluation detail level

Interpretation Guide

Score Range	Quality	Recommendation
≥ 0.9	Excellent	Ranking is optimal
≥ 0.7	Good	Acceptable for most use cases
< 0.7	Poor	Consider improving re-ranking
0.0	None	No relevant chunks found

Code Examples¶

Basic Usage Compare Rankings With Runner

from axion.metrics import ContextualRanking
from axion.dataset import DatasetItem

metric = ContextualRanking()

item = DatasetItem(
    query="What is machine learning?",
    retrieved_content=[
        "Machine learning is a subset of AI.",       # Relevant
        "Python is a programming language.",         # Not relevant
        "ML models learn from data.",                # Relevant
        "The weather is nice today.",                # Not relevant
    ],
)

result = await metric.execute(item)
print(result.pretty())
# Score: (1/1 + 2/3) / 2 = 0.83

from axion.metrics import ContextualRanking

metric = ContextualRanking()

# Good ranking: relevant first
good_order = DatasetItem(
    query="Benefits of exercise",
    retrieved_content=[
        "Exercise improves cardiovascular health.",   # Relevant
        "Regular workouts boost energy levels.",      # Relevant
        "Cooking is a useful skill.",                 # Not relevant
    ],
)
# Score: (1/1 + 2/2) / 2 = 1.0

# Bad ranking: relevant last
bad_order = DatasetItem(
    query="Benefits of exercise",
    retrieved_content=[
        "Cooking is a useful skill.",                 # Not relevant
        "Exercise improves cardiovascular health.",   # Relevant
        "Regular workouts boost energy levels.",      # Relevant
    ],
)
# Score: (1/2 + 2/3) / 2 = 0.58

from axion.metrics import ContextualRanking
from axion.runners import MetricRunner

metric = ContextualRanking()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"Ranking Score: {item_result.score}")
    print(f"Relevant: {item_result.signals.relevant_chunks}/{item_result.signals.total_chunks}")
    for i, chunk in enumerate(item_result.signals.chunk_breakdown):
        status = "✅" if chunk.is_relevant else "❌"
        print(f"  {i+1}. {status} {chunk.chunk_text[:40]}...")

Metric Diagnostics¶

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown

📊 ContextualRankingResult Structure

ContextualRankingResult(
{
    "final_score": 0.83,
    "relevant_chunks": 2,
    "total_chunks": 4,
    "chunk_breakdown": [
        {
            "chunk_text": "Machine learning is a subset of AI.",
            "is_relevant": true
        },
        {
            "chunk_text": "Python is a programming language.",
            "is_relevant": false
        },
        {
            "chunk_text": "ML models learn from data.",
            "is_relevant": true
        },
        {
            "chunk_text": "The weather is nice today.",
            "is_relevant": false
        }
    ]
}
)

Signal Fields¶

Field	Type	Description
`final_score`	`float`	Ranking score (0.0-1.0)
`relevant_chunks`	`int`	Number of relevant chunks
`total_chunks`	`int`	Total chunks retrieved
`chunk_breakdown`	`List`	Per-chunk verdict details

Chunk Breakdown Fields¶

Field	Type	Description
`chunk_text`	`str`	The retrieved chunk content
`is_relevant`	`bool`	Whether chunk is relevant to query

Example Scenarios¶

✅ Scenario 1: Excellent Ranking (Score: 1.0)

All Relevant Chunks First

Query:

"How does photosynthesis work?"

Retrieved Context (in order):

"Photosynthesis converts light energy into chemical energy." ✅
"Plants use chlorophyll to absorb sunlight." ✅
"Photosynthesis produces glucose and oxygen." ✅
"The ocean covers 71% of Earth." ❌
"Volcanic eruptions release gases." ❌

Ranking Calculation:

Relevant at positions: 1, 2, 3
P@1 = 1/1 = 1.0
P@2 = 2/2 = 1.0
P@3 = 3/3 = 1.0
Score = (1.0 + 1.0 + 1.0) / 3 = 1.0

Final Score: 1.0

⚠️ Scenario 2: Mediocre Ranking (Score: 0.5)

Relevant Chunks Scattered

Query:

"What are the benefits of meditation?"

Retrieved Context (in order):

"Yoga is an ancient practice." ❌
"Meditation reduces stress and anxiety." ✅
"Cooking can be therapeutic." ❌
"Mindfulness improves focus." ✅

Ranking Calculation:

Relevant at positions: 2, 4
P@2 = 1/2 = 0.5
P@4 = 2/4 = 0.5
Score = (0.5 + 0.5) / 2 = 0.5

Final Score: 0.5

Relevant content not prioritized at top positions.

❌ Scenario 3: Poor Ranking (Score: 0.33)

Relevant Chunks at Bottom

Query:

"What is the capital of Japan?"

Retrieved Context (in order):

"Japan has a population of 125 million." ❌
"Japanese cuisine includes sushi." ❌
"Tokyo is the capital of Japan." ✅

Ranking Calculation:

Relevant at positions: 3
P@3 = 1/3 = 0.33
Score = 0.33 / 1 = 0.33

Final Score: 0.33

The only relevant chunk is last—poor ranking quality.

Why It Matters¶

🎯 No Ground Truth Needed

Evaluate ranking quality without expected answers—ideal for production monitoring.

📊 Re-ranker Evaluation

Directly measures whether your re-ranking model improves result ordering.

⚡ Context Window Efficiency

When using top-k results, good ranking ensures the best content is included.

Quick Reference¶

TL;DR

Contextual Ranking = Are relevant chunks positioned at the top of results?

Use it when: Evaluating retrieval ranking without ground truth
Score interpretation: Higher = relevant chunks appear earlier
Key difference: Uses query relevance, not answer usefulness

API Reference

axion.metrics.ContextualRanking
Related Metrics

Contextual Precision · Contextual Relevancy · Faithfulness