Contextual Precision¶
LLM-Powered Knowledge Single Turn Retrieval
At a Glance¶
Score Range
0.0 ββββββββ 1.0Mean Average Precision
Default Threshold
0.5Pass/fail cutoff
Required Inputs
query expected_output retrieved_contentGround truth + context
What It Measures
Contextual Precision evaluates whether useful chunks are ranked higher in the retrieval results. Using Mean Average Precision (MAP), it rewards retrieval systems that place the most helpful documents at the top. A useful chunk is one that contributes to generating the expected answer.
| Score | Interpretation |
|---|---|
| 1.0 | All useful chunks at the top |
| 0.7+ | Good ranking, useful chunks near top |
| 0.5 | Mixed ranking quality |
| < 0.5 | Useful chunks buried low in results |
- Evaluating retrieval ranking quality
- Tuning re-ranking algorithms
- Testing with limited context windows
- Optimizing for top-k retrieval
- No expected_output available
- Chunk order doesn't matter
- Using all retrieved chunks equally
- Single-chunk retrieval
RAG Evaluation Suite
Contextual Precision asks: "Are the most useful chunks ranked first?"
Related retrieval metrics:
- Contextual Relevancy: Are chunks relevant to the query?
- Contextual Recall: Do chunks cover the expected answer?
- Contextual Ranking: Are relevant chunks ranked higher?
How It Works
The metric evaluates chunk usefulness for generating the expected answer, then calculates MAP.
Step-by-Step Process¶
flowchart TD
subgraph INPUT["π₯ Inputs"]
A[Query]
B[Expected Output]
C[Retrieved Chunks in Order]
end
subgraph EVALUATE["βοΈ Step 1: Usefulness Check"]
D[RAGAnalyzer Engine]
E1["Chunk 1: U/β"]
E2["Chunk 2: U/β"]
E3["Chunk 3: U/β"]
EN["..."]
end
subgraph MAP["π Step 2: Calculate MAP"]
F["For each useful chunk at position k"]
G["Precision@k = useful_seen / k"]
H["Average all Precision@k values"]
I["Final MAP Score"]
end
A & B & C --> D
D --> E1 & E2 & E3 & EN
E1 & E2 & E3 & EN --> F
F --> G
G --> H
H --> I
style INPUT stroke:#1E3A5F,stroke-width:2px
style EVALUATE stroke:#f59e0b,stroke-width:2px
style MAP stroke:#10b981,stroke-width:2px
style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff
Mean Average Precision rewards useful chunks appearing early in the ranking.
Example with 5 chunks (U = useful, X = not useful):
Position: 1 2 3 4 5
Chunks: [U] [X] [U] [X] [U]
Precision@1 = 1/1 = 1.0 (first useful at position 1)
Precision@3 = 2/3 = 0.67 (second useful at position 3)
Precision@5 = 3/5 = 0.6 (third useful at position 5)
MAP = (1.0 + 0.67 + 0.6) / 3 = 0.76
Chunk helps generate the expected answer. Contributes to MAP score.
Chunk doesn't contribute to the answer. Dilutes precision.
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
mode |
EvaluationMode |
GRANULAR |
Evaluation detail level |
Usefulness vs Relevancy
- Relevancy (Contextual Relevancy): "Is this chunk about the topic?"
- Usefulness (Contextual Precision): "Does this chunk help generate the correct answer?"
A chunk can be relevant but not useful (e.g., background info that isn't needed).
Code Examples¶
from axion.metrics import ContextualPrecision
from axion.dataset import DatasetItem
metric = ContextualPrecision()
item = DatasetItem(
query="Who invented the telephone?",
expected_output="Alexander Graham Bell invented the telephone in 1876.",
retrieved_content=[
"Alexander Graham Bell invented the telephone.", # Useful
"The telephone revolutionized communication.", # Not useful
"Bell patented it in 1876.", # Useful
],
)
result = await metric.execute(item)
print(result.pretty())
# MAP: (1/1 + 2/3) / 2 = 0.83
from axion.metrics import ContextualPrecision
metric = ContextualPrecision()
# Perfect ranking: useful chunks first
perfect_order = DatasetItem(
query="What is Python?",
expected_output="Python is a programming language created by Guido van Rossum.",
retrieved_content=[
"Python is a high-level programming language.", # Useful (pos 1)
"Guido van Rossum created Python.", # Useful (pos 2)
"Programming is fun.", # Not useful
],
)
# MAP = (1/1 + 2/2) / 2 = 1.0
# Poor ranking: useful chunks last
poor_order = DatasetItem(
query="What is Python?",
expected_output="Python is a programming language created by Guido van Rossum.",
retrieved_content=[
"Programming is fun.", # Not useful
"Python is a high-level programming language.", # Useful (pos 2)
"Guido van Rossum created Python.", # Useful (pos 3)
],
)
# MAP = (1/2 + 2/3) / 2 = 0.58
from axion.metrics import ContextualPrecision
from axion.runners import MetricRunner
metric = ContextualPrecision()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)
for item_result in results:
print(f"MAP Score: {item_result.score}")
print(f"Useful chunks: {item_result.signals.useful_chunks}/{item_result.signals.total_chunks}")
print(f"First useful at position: {item_result.signals.first_useful_position}")
for i, chunk in enumerate(item_result.signals.chunk_breakdown):
status = "β
" if chunk.is_useful else "β"
print(f" {i+1}. {status} {chunk.chunk_text[:40]}...")
Metric Diagnostics¶
Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was givenβno black boxes.
result = await metric.execute(item)
print(result.pretty()) # Human-readable summary
result.signals # Full diagnostic breakdown
π ContextualPrecisionResult Structure
ContextualPrecisionResult(
{
"map_score": 0.83,
"total_chunks": 3,
"useful_chunks": 2,
"first_useful_position": 1,
"chunk_breakdown": [
{
"chunk_text": "Alexander Graham Bell invented the telephone.",
"is_useful": true,
"position": 1
},
{
"chunk_text": "The telephone revolutionized communication.",
"is_useful": false,
"position": 2
},
{
"chunk_text": "Bell patented it in 1876.",
"is_useful": true,
"position": 3
}
]
}
)
Signal Fields¶
| Field | Type | Description |
|---|---|---|
map_score |
float |
Mean Average Precision (0.0-1.0) |
total_chunks |
int |
Total chunks retrieved |
useful_chunks |
int |
Chunks useful for generating answer |
first_useful_position |
int |
Rank of first useful chunk |
chunk_breakdown |
List |
Per-chunk verdict details |
Chunk Breakdown Fields¶
| Field | Type | Description |
|---|---|---|
chunk_text |
str |
The retrieved chunk content |
is_useful |
bool |
Whether chunk helps generate expected answer |
position |
int |
Rank position in retrieval results |
Example Scenarios¶
β Scenario 1: Perfect Precision (Score: 1.0)
Useful Chunks Ranked First
Query:
"What are the three states of matter?"
Expected Output:
"The three states of matter are solid, liquid, and gas."
Retrieved Context (in order):
- "Matter exists in three states: solid, liquid, and gas." β
- "Solids have fixed shape, liquids take container shape." β
- "Gases expand to fill available space." β
- "Matter is anything that has mass." β
- "Chemistry is the study of matter." β
MAP Calculation:
Useful at positions: 1, 2, 3
P@1 = 1/1 = 1.0
P@2 = 2/2 = 1.0
P@3 = 3/3 = 1.0
MAP = (1.0 + 1.0 + 1.0) / 3 = 1.0
Final Score: 1.0
β οΈ Scenario 2: Mixed Precision (Score: 0.58)
Useful Chunks Buried
Query:
"Who wrote Romeo and Juliet?"
Expected Output:
"William Shakespeare wrote Romeo and Juliet."
Retrieved Context (in order):
- "Shakespeare was born in Stratford-upon-Avon." β
- "Romeo and Juliet is a famous tragedy." β
- "William Shakespeare wrote Romeo and Juliet." β
- "The play was written in the 1590s." β
MAP Calculation:
Final Score: 0.42
Key information buried at positions 3-4 instead of 1-2.
β Scenario 3: Poor Precision (Score: 0.2)
Useful Chunk at Bottom
Query:
"What is the speed of light?"
Expected Output:
"The speed of light is approximately 299,792 km/s."
Retrieved Context (in order):
- "Light is a form of electromagnetic radiation." β
- "Light travels in waves." β
- "Light can be refracted through prisms." β
- "Light behaves as both particles and waves." β
- "The speed of light is about 300,000 km/s in vacuum." β
MAP Calculation:
Final Score: 0.2
The only useful chunk is at the very bottom.
Why It Matters¶
Measures not just what you retrieve, but how well you rank it. Critical for top-k systems.
When context windows are limited, having useful chunks first means better answers faster.
Directly measures re-ranking model performance. Low MAP = improve your re-ranker.
Quick Reference¶
TL;DR
Contextual Precision = Are the most useful chunks ranked at the top?
- Use it when: Evaluating retrieval ranking, especially with limited context windows
- Score interpretation: Higher MAP = useful chunks appear earlier
- Key formula: Mean Average Precision over useful chunk positions
-
API Reference
-
Related Metrics
Contextual Ranking Β· Contextual Recall Β· Contextual Relevancy