Skip to content

Contextual Ranking

Evaluate if relevant context chunks are ranked higher
LLM-Powered Knowledge Single Turn Retrieval

At a Glance

🎯
Score Range
0.0 ──────── 1.0
Precision-weighted ranking score
⚡
Default Threshold
0.5
Pass/fail cutoff
đź“‹
Required Inputs
query retrieved_content
No expected_output needed

What It Measures

Contextual Ranking evaluates whether relevant chunks are positioned higher in retrieval results. Unlike Contextual Precision (which checks usefulness for generating an answer), Ranking simply checks query relevance—making it usable without ground truth.

Score Interpretation
≥ 0.9 Excellent ranking—relevant chunks at top
≥ 0.7 Good ranking quality
0.5 Mediocre—relevant chunks scattered
< 0.5 Poor ranking—relevant chunks buried
âś… Use When
  • No expected_output available
  • Evaluating retrieval ranking
  • Comparing re-ranking algorithms
  • Testing search relevance
❌ Don't Use When
  • You have expected_output (use Precision)
  • Chunk order doesn't matter
  • Single-chunk retrieval
  • Evaluating answer quality

Ranking vs Precision

Contextual Ranking checks: "Are relevant chunks ranked higher?" (based on query) Contextual Precision checks: "Are useful chunks ranked higher?" (based on expected answer)

Use Ranking when you don't have ground truth; use Precision when you do.


How It Works

The metric evaluates chunk relevance to the query, then calculates a precision-weighted ranking score.

Step-by-Step Process

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Query]
        B[Retrieved Chunks in Order]
    end

    subgraph EVALUATE["⚖️ Step 1: Relevancy Check"]
        C[RAGAnalyzer Engine]
        D1["Chunk 1: R/âś—"]
        D2["Chunk 2: R/âś—"]
        D3["Chunk 3: R/âś—"]
        DN["..."]
    end

    subgraph RANK["📊 Step 2: Calculate Ranking Score"]
        E["For each relevant chunk at position k"]
        F["Precision@k = relevant_seen / k"]
        G["Sum all Precision@k values"]
        H["Divide by total relevant chunks"]
        I["Final Score"]
    end

    A & B --> C
    C --> D1 & D2 & D3 & DN
    D1 & D2 & D3 & DN --> E
    E --> F
    F --> G
    G --> H
    H --> I

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style EVALUATE stroke:#f59e0b,stroke-width:2px
    style RANK stroke:#10b981,stroke-width:2px
    style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

The score heavily penalizes relevant chunks ranked low.

Example with 5 chunks (R = relevant, X = not relevant):

Position:  1    2    3    4    5
Chunks:   [R]  [X]  [R]  [X]  [R]

Precision@1 = 1/1 = 1.0   (first relevant at position 1)
Precision@3 = 2/3 = 0.67  (second relevant at position 3)
Precision@5 = 3/5 = 0.6   (third relevant at position 5)

Score = (1.0 + 0.67 + 0.6) / 3 = 0.76

âś… RELEVANT
+P@k

Chunk is relevant to the query. Contributes to ranking score.

❌ NOT RELEVANT
0

Chunk is off-topic. Dilutes precision at each position.

Score Formula

score = sum(Precision@k for each relevant chunk) / total_relevant_chunks
score = clamp(score, 0.0, 1.0)

Configuration

Parameter Type Default Description
mode EvaluationMode GRANULAR Evaluation detail level

Interpretation Guide

Score Range Quality Recommendation
≥ 0.9 Excellent Ranking is optimal
≥ 0.7 Good Acceptable for most use cases
< 0.7 Poor Consider improving re-ranking
0.0 None No relevant chunks found

Code Examples

from axion.metrics import ContextualRanking
from axion.dataset import DatasetItem

metric = ContextualRanking()

item = DatasetItem(
    query="What is machine learning?",
    retrieved_content=[
        "Machine learning is a subset of AI.",       # Relevant
        "Python is a programming language.",         # Not relevant
        "ML models learn from data.",                # Relevant
        "The weather is nice today.",                # Not relevant
    ],
)

result = await metric.execute(item)
print(result.pretty())
# Score: (1/1 + 2/3) / 2 = 0.83
from axion.metrics import ContextualRanking

metric = ContextualRanking()

# Good ranking: relevant first
good_order = DatasetItem(
    query="Benefits of exercise",
    retrieved_content=[
        "Exercise improves cardiovascular health.",   # Relevant
        "Regular workouts boost energy levels.",      # Relevant
        "Cooking is a useful skill.",                 # Not relevant
    ],
)
# Score: (1/1 + 2/2) / 2 = 1.0

# Bad ranking: relevant last
bad_order = DatasetItem(
    query="Benefits of exercise",
    retrieved_content=[
        "Cooking is a useful skill.",                 # Not relevant
        "Exercise improves cardiovascular health.",   # Relevant
        "Regular workouts boost energy levels.",      # Relevant
    ],
)
# Score: (1/2 + 2/3) / 2 = 0.58
from axion.metrics import ContextualRanking
from axion.runners import MetricRunner

metric = ContextualRanking()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"Ranking Score: {item_result.score}")
    print(f"Relevant: {item_result.signals.relevant_chunks}/{item_result.signals.total_chunks}")
    for i, chunk in enumerate(item_result.signals.chunk_breakdown):
        status = "✅" if chunk.is_relevant else "❌"
        print(f"  {i+1}. {status} {chunk.chunk_text[:40]}...")

Metric Diagnostics

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown
📊 ContextualRankingResult Structure
ContextualRankingResult(
{
    "final_score": 0.83,
    "relevant_chunks": 2,
    "total_chunks": 4,
    "chunk_breakdown": [
        {
            "chunk_text": "Machine learning is a subset of AI.",
            "is_relevant": true
        },
        {
            "chunk_text": "Python is a programming language.",
            "is_relevant": false
        },
        {
            "chunk_text": "ML models learn from data.",
            "is_relevant": true
        },
        {
            "chunk_text": "The weather is nice today.",
            "is_relevant": false
        }
    ]
}
)

Signal Fields

Field Type Description
final_score float Ranking score (0.0-1.0)
relevant_chunks int Number of relevant chunks
total_chunks int Total chunks retrieved
chunk_breakdown List Per-chunk verdict details

Chunk Breakdown Fields

Field Type Description
chunk_text str The retrieved chunk content
is_relevant bool Whether chunk is relevant to query

Example Scenarios

âś… Scenario 1: Excellent Ranking (Score: 1.0)

All Relevant Chunks First

Query:

"How does photosynthesis work?"

Retrieved Context (in order):

  1. "Photosynthesis converts light energy into chemical energy." âś…
  2. "Plants use chlorophyll to absorb sunlight." âś…
  3. "Photosynthesis produces glucose and oxygen." âś…
  4. "The ocean covers 71% of Earth." ❌
  5. "Volcanic eruptions release gases." ❌

Ranking Calculation:

Relevant at positions: 1, 2, 3
P@1 = 1/1 = 1.0
P@2 = 2/2 = 1.0
P@3 = 3/3 = 1.0
Score = (1.0 + 1.0 + 1.0) / 3 = 1.0

Final Score: 1.0

⚠️ Scenario 2: Mediocre Ranking (Score: 0.5)

Relevant Chunks Scattered

Query:

"What are the benefits of meditation?"

Retrieved Context (in order):

  1. "Yoga is an ancient practice." ❌
  2. "Meditation reduces stress and anxiety." âś…
  3. "Cooking can be therapeutic." ❌
  4. "Mindfulness improves focus." âś…

Ranking Calculation:

Relevant at positions: 2, 4
P@2 = 1/2 = 0.5
P@4 = 2/4 = 0.5
Score = (0.5 + 0.5) / 2 = 0.5

Final Score: 0.5

Relevant content not prioritized at top positions.

❌ Scenario 3: Poor Ranking (Score: 0.33)

Relevant Chunks at Bottom

Query:

"What is the capital of Japan?"

Retrieved Context (in order):

  1. "Japan has a population of 125 million." ❌
  2. "Japanese cuisine includes sushi." ❌
  3. "Tokyo is the capital of Japan." âś…

Ranking Calculation:

Relevant at positions: 3
P@3 = 1/3 = 0.33
Score = 0.33 / 1 = 0.33

Final Score: 0.33

The only relevant chunk is last—poor ranking quality.


Why It Matters

🎯 No Ground Truth Needed

Evaluate ranking quality without expected answers—ideal for production monitoring.

📊 Re-ranker Evaluation

Directly measures whether your re-ranking model improves result ordering.

⚡ Context Window Efficiency

When using top-k results, good ranking ensures the best content is included.


Quick Reference

TL;DR

Contextual Ranking = Are relevant chunks positioned at the top of results?

  • Use it when: Evaluating retrieval ranking without ground truth
  • Score interpretation: Higher = relevant chunks appear earlier
  • Key difference: Uses query relevance, not answer usefulness