Contextual Precision¶

Evaluate if useful context chunks are ranked higher
LLM-Powered Knowledge Single Turn Retrieval

At a Glance¶

🎯
Score Range
0.0 ──────── 1.0
Mean Average Precision

⚡
Default Threshold
0.5
Pass/fail cutoff

📋
Required Inputs
query expected_output retrieved_content
Ground truth + context

What It Measures

Contextual Precision evaluates whether useful chunks are ranked higher in the retrieval results. Using Mean Average Precision (MAP), it rewards retrieval systems that place the most helpful documents at the top. A useful chunk is one that contributes to generating the expected answer.

Score	Interpretation
1.0	All useful chunks at the top
0.7+	Good ranking, useful chunks near top
0.5	Mixed ranking quality
< 0.5	Useful chunks buried low in results

✅ Use When

Evaluating retrieval ranking quality
Tuning re-ranking algorithms
Testing with limited context windows
Optimizing for top-k retrieval

❌ Don't Use When

No expected_output available
Chunk order doesn't matter
Using all retrieved chunks equally
Single-chunk retrieval

RAG Evaluation Suite

Contextual Precision asks: "Are the most useful chunks ranked first?"

Related retrieval metrics:

Contextual Relevancy: Are chunks relevant to the query?
Contextual Recall: Do chunks cover the expected answer?
Contextual Ranking: Are relevant chunks ranked higher?

How It Works

Computation MAP Calculation

The metric evaluates chunk usefulness for generating the expected answer, then calculates MAP.

Step-by-Step Process¶

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Query]
        B[Expected Output]
        C[Retrieved Chunks in Order]
    end

    subgraph EVALUATE["⚖️ Step 1: Usefulness Check"]
        D[RAGAnalyzer Engine]
        E1["Chunk 1: U/✗"]
        E2["Chunk 2: U/✗"]
        E3["Chunk 3: U/✗"]
        EN["..."]
    end

    subgraph MAP["📊 Step 2: Calculate MAP"]
        F["For each useful chunk at position k"]
        G["Precision@k = useful_seen / k"]
        H["Average all Precision@k values"]
        I["Final MAP Score"]
    end

    A & B & C --> D
    D --> E1 & E2 & E3 & EN
    E1 & E2 & E3 & EN --> F
    F --> G
    G --> H
    H --> I

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style EVALUATE stroke:#f59e0b,stroke-width:2px
    style MAP stroke:#10b981,stroke-width:2px
    style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

Mean Average Precision rewards useful chunks appearing early in the ranking.

Example with 5 chunks (U = useful, X = not useful):

Position:  1    2    3    4    5
Chunks:   [U]  [X]  [U]  [X]  [U]

Precision@1 = 1/1 = 1.0   (first useful at position 1)
Precision@3 = 2/3 = 0.67  (second useful at position 3)
Precision@5 = 3/5 = 0.6   (third useful at position 5)

MAP = (1.0 + 0.67 + 0.6) / 3 = 0.76

✅ USEFUL

+P@k

Chunk helps generate the expected answer. Contributes to MAP score.

❌ NOT USEFUL

0

Chunk doesn't contribute to the answer. Dilutes precision.

Score Formula

MAP = sum(Precision@k for each useful chunk) / total_useful_chunks

Configuration¶

Parameters

Parameter	Type	Default	Description
`mode`	`EvaluationMode`	`GRANULAR`	Evaluation detail level

Usefulness vs Relevancy

Relevancy (Contextual Relevancy): "Is this chunk about the topic?"
Usefulness (Contextual Precision): "Does this chunk help generate the correct answer?"

A chunk can be relevant but not useful (e.g., background info that isn't needed).

Code Examples¶

Basic Usage Perfect vs Poor Ranking With Runner

from axion.metrics import ContextualPrecision
from axion.dataset import DatasetItem

metric = ContextualPrecision()

item = DatasetItem(
    query="Who invented the telephone?",
    expected_output="Alexander Graham Bell invented the telephone in 1876.",
    retrieved_content=[
        "Alexander Graham Bell invented the telephone.",  # Useful
        "The telephone revolutionized communication.",    # Not useful
        "Bell patented it in 1876.",                      # Useful
    ],
)

result = await metric.execute(item)
print(result.pretty())
# MAP: (1/1 + 2/3) / 2 = 0.83

from axion.metrics import ContextualPrecision

metric = ContextualPrecision()

# Perfect ranking: useful chunks first
perfect_order = DatasetItem(
    query="What is Python?",
    expected_output="Python is a programming language created by Guido van Rossum.",
    retrieved_content=[
        "Python is a high-level programming language.",   # Useful (pos 1)
        "Guido van Rossum created Python.",               # Useful (pos 2)
        "Programming is fun.",                            # Not useful
    ],
)
# MAP = (1/1 + 2/2) / 2 = 1.0

# Poor ranking: useful chunks last
poor_order = DatasetItem(
    query="What is Python?",
    expected_output="Python is a programming language created by Guido van Rossum.",
    retrieved_content=[
        "Programming is fun.",                            # Not useful
        "Python is a high-level programming language.",   # Useful (pos 2)
        "Guido van Rossum created Python.",               # Useful (pos 3)
    ],
)
# MAP = (1/2 + 2/3) / 2 = 0.58

from axion.metrics import ContextualPrecision
from axion.runners import MetricRunner

metric = ContextualPrecision()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"MAP Score: {item_result.score}")
    print(f"Useful chunks: {item_result.signals.useful_chunks}/{item_result.signals.total_chunks}")
    print(f"First useful at position: {item_result.signals.first_useful_position}")
    for i, chunk in enumerate(item_result.signals.chunk_breakdown):
        status = "✅" if chunk.is_useful else "❌"
        print(f"  {i+1}. {status} {chunk.chunk_text[:40]}...")

Metric Diagnostics¶

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown

📊 ContextualPrecisionResult Structure

ContextualPrecisionResult(
{
    "map_score": 0.83,
    "total_chunks": 3,
    "useful_chunks": 2,
    "first_useful_position": 1,
    "chunk_breakdown": [
        {
            "chunk_text": "Alexander Graham Bell invented the telephone.",
            "is_useful": true,
            "position": 1
        },
        {
            "chunk_text": "The telephone revolutionized communication.",
            "is_useful": false,
            "position": 2
        },
        {
            "chunk_text": "Bell patented it in 1876.",
            "is_useful": true,
            "position": 3
        }
    ]
}
)

Signal Fields¶

Field	Type	Description
`map_score`	`float`	Mean Average Precision (0.0-1.0)
`total_chunks`	`int`	Total chunks retrieved
`useful_chunks`	`int`	Chunks useful for generating answer
`first_useful_position`	`int`	Rank of first useful chunk
`chunk_breakdown`	`List`	Per-chunk verdict details

Chunk Breakdown Fields¶

Field	Type	Description
`chunk_text`	`str`	The retrieved chunk content
`is_useful`	`bool`	Whether chunk helps generate expected answer
`position`	`int`	Rank position in retrieval results

Example Scenarios¶

✅ Scenario 1: Perfect Precision (Score: 1.0)

Useful Chunks Ranked First

Query:

"What are the three states of matter?"

Expected Output:

"The three states of matter are solid, liquid, and gas."

Retrieved Context (in order):

"Matter exists in three states: solid, liquid, and gas." ✅
"Solids have fixed shape, liquids take container shape." ✅
"Gases expand to fill available space." ✅
"Matter is anything that has mass." ❌
"Chemistry is the study of matter." ❌

MAP Calculation:

Useful at positions: 1, 2, 3
P@1 = 1/1 = 1.0
P@2 = 2/2 = 1.0
P@3 = 3/3 = 1.0
MAP = (1.0 + 1.0 + 1.0) / 3 = 1.0

Final Score: 1.0

⚠️ Scenario 2: Mixed Precision (Score: 0.58)

Useful Chunks Buried

Query:

"Who wrote Romeo and Juliet?"

Expected Output:

"William Shakespeare wrote Romeo and Juliet."

Retrieved Context (in order):

"Shakespeare was born in Stratford-upon-Avon." ❌
"Romeo and Juliet is a famous tragedy." ❌
"William Shakespeare wrote Romeo and Juliet." ✅
"The play was written in the 1590s." ✅

MAP Calculation:

Useful at positions: 3, 4
P@3 = 1/3 = 0.33
P@4 = 2/4 = 0.50
MAP = (0.33 + 0.50) / 2 = 0.42

Final Score: 0.42

Key information buried at positions 3-4 instead of 1-2.

❌ Scenario 3: Poor Precision (Score: 0.2)

Useful Chunk at Bottom

Query:

"What is the speed of light?"

Expected Output:

"The speed of light is approximately 299,792 km/s."

Retrieved Context (in order):

"Light is a form of electromagnetic radiation." ❌
"Light travels in waves." ❌
"Light can be refracted through prisms." ❌
"Light behaves as both particles and waves." ❌
"The speed of light is about 300,000 km/s in vacuum." ✅

MAP Calculation:

Useful at positions: 5
P@5 = 1/5 = 0.2
MAP = 0.2 / 1 = 0.2

Final Score: 0.2

The only useful chunk is at the very bottom.

Why It Matters¶

📊 Ranking Quality

Measures not just what you retrieve, but how well you rank it. Critical for top-k systems.

⚡ Efficiency

When context windows are limited, having useful chunks first means better answers faster.

🔧 Re-ranking Tuning

Directly measures re-ranking model performance. Low MAP = improve your re-ranker.

Quick Reference¶

TL;DR

Contextual Precision = Are the most useful chunks ranked at the top?

Use it when: Evaluating retrieval ranking, especially with limited context windows
Score interpretation: Higher MAP = useful chunks appear earlier
Key formula: Mean Average Precision over useful chunk positions

API Reference

axion.metrics.ContextualPrecision
Related Metrics

Contextual Ranking · Contextual Recall · Contextual Relevancy