Skip to content

Contextual Precision

Evaluate if useful context chunks are ranked higher
LLM-Powered Knowledge Single Turn Retrieval

At a Glance

🎯
Score Range
0.0 ──────── 1.0
Mean Average Precision
⚑
Default Threshold
0.5
Pass/fail cutoff
πŸ“‹
Required Inputs
query expected_output retrieved_content
Ground truth + context

What It Measures

Contextual Precision evaluates whether useful chunks are ranked higher in the retrieval results. Using Mean Average Precision (MAP), it rewards retrieval systems that place the most helpful documents at the top. A useful chunk is one that contributes to generating the expected answer.

Score Interpretation
1.0 All useful chunks at the top
0.7+ Good ranking, useful chunks near top
0.5 Mixed ranking quality
< 0.5 Useful chunks buried low in results
βœ… Use When
  • Evaluating retrieval ranking quality
  • Tuning re-ranking algorithms
  • Testing with limited context windows
  • Optimizing for top-k retrieval
❌ Don't Use When
  • No expected_output available
  • Chunk order doesn't matter
  • Using all retrieved chunks equally
  • Single-chunk retrieval

RAG Evaluation Suite

Contextual Precision asks: "Are the most useful chunks ranked first?"

Related retrieval metrics:


How It Works

The metric evaluates chunk usefulness for generating the expected answer, then calculates MAP.

Step-by-Step Process

flowchart TD
    subgraph INPUT["πŸ“₯ Inputs"]
        A[Query]
        B[Expected Output]
        C[Retrieved Chunks in Order]
    end

    subgraph EVALUATE["βš–οΈ Step 1: Usefulness Check"]
        D[RAGAnalyzer Engine]
        E1["Chunk 1: U/βœ—"]
        E2["Chunk 2: U/βœ—"]
        E3["Chunk 3: U/βœ—"]
        EN["..."]
    end

    subgraph MAP["πŸ“Š Step 2: Calculate MAP"]
        F["For each useful chunk at position k"]
        G["Precision@k = useful_seen / k"]
        H["Average all Precision@k values"]
        I["Final MAP Score"]
    end

    A & B & C --> D
    D --> E1 & E2 & E3 & EN
    E1 & E2 & E3 & EN --> F
    F --> G
    G --> H
    H --> I

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style EVALUATE stroke:#f59e0b,stroke-width:2px
    style MAP stroke:#10b981,stroke-width:2px
    style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

Mean Average Precision rewards useful chunks appearing early in the ranking.

Example with 5 chunks (U = useful, X = not useful):

Position:  1    2    3    4    5
Chunks:   [U]  [X]  [U]  [X]  [U]

Precision@1 = 1/1 = 1.0   (first useful at position 1)
Precision@3 = 2/3 = 0.67  (second useful at position 3)
Precision@5 = 3/5 = 0.6   (third useful at position 5)

MAP = (1.0 + 0.67 + 0.6) / 3 = 0.76

βœ… USEFUL
+P@k

Chunk helps generate the expected answer. Contributes to MAP score.

❌ NOT USEFUL
0

Chunk doesn't contribute to the answer. Dilutes precision.

Score Formula

MAP = sum(Precision@k for each useful chunk) / total_useful_chunks

Configuration

Parameter Type Default Description
mode EvaluationMode GRANULAR Evaluation detail level

Usefulness vs Relevancy

  • Relevancy (Contextual Relevancy): "Is this chunk about the topic?"
  • Usefulness (Contextual Precision): "Does this chunk help generate the correct answer?"

A chunk can be relevant but not useful (e.g., background info that isn't needed).


Code Examples

from axion.metrics import ContextualPrecision
from axion.dataset import DatasetItem

metric = ContextualPrecision()

item = DatasetItem(
    query="Who invented the telephone?",
    expected_output="Alexander Graham Bell invented the telephone in 1876.",
    retrieved_content=[
        "Alexander Graham Bell invented the telephone.",  # Useful
        "The telephone revolutionized communication.",    # Not useful
        "Bell patented it in 1876.",                      # Useful
    ],
)

result = await metric.execute(item)
print(result.pretty())
# MAP: (1/1 + 2/3) / 2 = 0.83
from axion.metrics import ContextualPrecision

metric = ContextualPrecision()

# Perfect ranking: useful chunks first
perfect_order = DatasetItem(
    query="What is Python?",
    expected_output="Python is a programming language created by Guido van Rossum.",
    retrieved_content=[
        "Python is a high-level programming language.",   # Useful (pos 1)
        "Guido van Rossum created Python.",               # Useful (pos 2)
        "Programming is fun.",                            # Not useful
    ],
)
# MAP = (1/1 + 2/2) / 2 = 1.0

# Poor ranking: useful chunks last
poor_order = DatasetItem(
    query="What is Python?",
    expected_output="Python is a programming language created by Guido van Rossum.",
    retrieved_content=[
        "Programming is fun.",                            # Not useful
        "Python is a high-level programming language.",   # Useful (pos 2)
        "Guido van Rossum created Python.",               # Useful (pos 3)
    ],
)
# MAP = (1/2 + 2/3) / 2 = 0.58
from axion.metrics import ContextualPrecision
from axion.runners import MetricRunner

metric = ContextualPrecision()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"MAP Score: {item_result.score}")
    print(f"Useful chunks: {item_result.signals.useful_chunks}/{item_result.signals.total_chunks}")
    print(f"First useful at position: {item_result.signals.first_useful_position}")
    for i, chunk in enumerate(item_result.signals.chunk_breakdown):
        status = "βœ…" if chunk.is_useful else "❌"
        print(f"  {i+1}. {status} {chunk.chunk_text[:40]}...")

Metric Diagnostics

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was givenβ€”no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown
πŸ“Š ContextualPrecisionResult Structure
ContextualPrecisionResult(
{
    "map_score": 0.83,
    "total_chunks": 3,
    "useful_chunks": 2,
    "first_useful_position": 1,
    "chunk_breakdown": [
        {
            "chunk_text": "Alexander Graham Bell invented the telephone.",
            "is_useful": true,
            "position": 1
        },
        {
            "chunk_text": "The telephone revolutionized communication.",
            "is_useful": false,
            "position": 2
        },
        {
            "chunk_text": "Bell patented it in 1876.",
            "is_useful": true,
            "position": 3
        }
    ]
}
)

Signal Fields

Field Type Description
map_score float Mean Average Precision (0.0-1.0)
total_chunks int Total chunks retrieved
useful_chunks int Chunks useful for generating answer
first_useful_position int Rank of first useful chunk
chunk_breakdown List Per-chunk verdict details

Chunk Breakdown Fields

Field Type Description
chunk_text str The retrieved chunk content
is_useful bool Whether chunk helps generate expected answer
position int Rank position in retrieval results

Example Scenarios

βœ… Scenario 1: Perfect Precision (Score: 1.0)

Useful Chunks Ranked First

Query:

"What are the three states of matter?"

Expected Output:

"The three states of matter are solid, liquid, and gas."

Retrieved Context (in order):

  1. "Matter exists in three states: solid, liquid, and gas." βœ…
  2. "Solids have fixed shape, liquids take container shape." βœ…
  3. "Gases expand to fill available space." βœ…
  4. "Matter is anything that has mass." ❌
  5. "Chemistry is the study of matter." ❌

MAP Calculation:

Useful at positions: 1, 2, 3
P@1 = 1/1 = 1.0
P@2 = 2/2 = 1.0
P@3 = 3/3 = 1.0
MAP = (1.0 + 1.0 + 1.0) / 3 = 1.0

Final Score: 1.0

⚠️ Scenario 2: Mixed Precision (Score: 0.58)

Useful Chunks Buried

Query:

"Who wrote Romeo and Juliet?"

Expected Output:

"William Shakespeare wrote Romeo and Juliet."

Retrieved Context (in order):

  1. "Shakespeare was born in Stratford-upon-Avon." ❌
  2. "Romeo and Juliet is a famous tragedy." ❌
  3. "William Shakespeare wrote Romeo and Juliet." βœ…
  4. "The play was written in the 1590s." βœ…

MAP Calculation:

Useful at positions: 3, 4
P@3 = 1/3 = 0.33
P@4 = 2/4 = 0.50
MAP = (0.33 + 0.50) / 2 = 0.42

Final Score: 0.42

Key information buried at positions 3-4 instead of 1-2.

❌ Scenario 3: Poor Precision (Score: 0.2)

Useful Chunk at Bottom

Query:

"What is the speed of light?"

Expected Output:

"The speed of light is approximately 299,792 km/s."

Retrieved Context (in order):

  1. "Light is a form of electromagnetic radiation." ❌
  2. "Light travels in waves." ❌
  3. "Light can be refracted through prisms." ❌
  4. "Light behaves as both particles and waves." ❌
  5. "The speed of light is about 300,000 km/s in vacuum." βœ…

MAP Calculation:

Useful at positions: 5
P@5 = 1/5 = 0.2
MAP = 0.2 / 1 = 0.2

Final Score: 0.2

The only useful chunk is at the very bottom.


Why It Matters

πŸ“Š Ranking Quality

Measures not just what you retrieve, but how well you rank it. Critical for top-k systems.

⚑ Efficiency

When context windows are limited, having useful chunks first means better answers faster.

πŸ”§ Re-ranking Tuning

Directly measures re-ranking model performance. Low MAP = improve your re-ranker.


Quick Reference

TL;DR

Contextual Precision = Are the most useful chunks ranked at the top?

  • Use it when: Evaluating retrieval ranking, especially with limited context windows
  • Score interpretation: Higher MAP = useful chunks appear earlier
  • Key formula: Mean Average Precision over useful chunk positions