Contextual Recall¶
LLM-Powered Knowledge Single Turn Retrieval
At a Glance¶
Score Range
0.0 ββββββββ 1.0Ratio of supported statements
Default Threshold
0.5Pass/fail cutoff
Required Inputs
expected_output retrieved_contentGround truth required
What It Measures
Contextual Recall evaluates whether the retrieved context contains sufficient information to support the expected answer. It extracts statements from the ground truth and checks if each is supported by the retrieved chunks. High recall means the retrieval didn't miss important information.
| Score | Interpretation |
|---|---|
| 1.0 | All expected facts are in retrieved context |
| 0.7+ | Most expected facts supported, minor gaps |
| 0.5 | Half the expected facts missing from context |
| < 0.5 | Significant information not retrieved |
- You have ground truth answers
- Evaluating retrieval completeness
- Testing if critical info is retrieved
- Debugging "information not found" errors
- No expected_output available
- Multiple valid answers exist
- Testing retrieval ranking (use Precision)
- Evaluating generation quality
RAG Evaluation Suite
Contextual Recall asks: "Does the context contain everything needed to answer correctly?"
Related retrieval metrics:
- Contextual Relevancy: Are chunks relevant to the query?
- Contextual Precision: Are useful chunks ranked higher?
- Contextual Sufficiency: Is there enough info overall?
How It Works
The metric extracts factual statements from the expected answer and checks context support.
Step-by-Step Process¶
flowchart TD
subgraph INPUT["π₯ Inputs"]
A[Expected Output]
B[Retrieved Context]
end
subgraph EXTRACT["π Step 1: Statement Extraction"]
C[Extract Factual Statements]
D["Ground Truth Statements"]
end
subgraph CHECK["βοΈ Step 2: Support Check"]
E[Check Against Context]
F1["Stmt 1: β/β"]
F2["Stmt 2: β/β"]
F3["Stmt 3: β/β"]
FN["..."]
end
subgraph SCORE["π Step 3: Scoring"]
G["Count Supported"]
H["Calculate Ratio"]
I["Final Score"]
end
A --> C
C --> D
D --> E
B --> E
E --> F1 & F2 & F3 & FN
F1 & F2 & F3 & FN --> G
G --> H
H --> I
style INPUT stroke:#1E3A5F,stroke-width:2px
style EXTRACT stroke:#3b82f6,stroke-width:2px
style CHECK stroke:#f59e0b,stroke-width:2px
style SCORE stroke:#10b981,stroke-width:2px
style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
mode |
EvaluationMode |
GRANULAR |
Evaluation detail level |
Ground Truth Focus
Unlike Contextual Relevancy (which asks "is this chunk relevant?"), Recall asks "is this expected fact present?" It measures retrieval from the answer's perspective.
Code Examples¶
from axion.metrics import ContextualRecall
from axion.dataset import DatasetItem
metric = ContextualRecall()
item = DatasetItem(
expected_output="Paris is the capital of France. It has a population of about 2 million.",
retrieved_content=[
"Paris is the capital and largest city of France.",
"The Eiffel Tower is located in Paris.",
],
)
result = await metric.execute(item)
print(result.pretty())
# Score: 0.5 (capital fact supported, population fact missing)
from axion.metrics import ContextualRecall
metric = ContextualRecall()
item = DatasetItem(
expected_output="Python was created by Guido van Rossum in 1991.",
retrieved_content=[
"Python is a programming language created by Guido van Rossum.",
"Python was first released in 1991.",
"Python emphasizes code readability.",
],
)
result = await metric.execute(item)
# Score: 1.0 (both creator and year facts are in context)
from axion.metrics import ContextualRecall
from axion.runners import MetricRunner
metric = ContextualRecall()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)
for item_result in results:
print(f"Score: {item_result.score}")
print(f"Supported: {item_result.signals.supported_gt_statements}/{item_result.signals.total_gt_statements}")
for stmt in item_result.signals.statement_breakdown:
status = "β
" if stmt.is_supported else "β"
print(f" {status} {stmt.statement_text}")
Metric Diagnostics¶
Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was givenβno black boxes.
result = await metric.execute(item)
print(result.pretty()) # Human-readable summary
result.signals # Full diagnostic breakdown
π ContextualRecallResult Structure
ContextualRecallResult(
{
"recall_score": 0.5,
"total_gt_statements": 2,
"supported_gt_statements": 1,
"statement_breakdown": [
{
"statement_text": "Paris is the capital of France",
"is_supported": true
},
{
"statement_text": "Paris has a population of about 2 million",
"is_supported": false
}
]
}
)
Signal Fields¶
| Field | Type | Description |
|---|---|---|
recall_score |
float |
Ratio of supported statements (0.0-1.0) |
total_gt_statements |
int |
Factual statements from expected output |
supported_gt_statements |
int |
Statements found in context |
statement_breakdown |
List |
Per-statement verdict details |
Statement Breakdown Fields¶
| Field | Type | Description |
|---|---|---|
statement_text |
str |
The ground truth statement |
is_supported |
bool |
Whether statement is in context |
Example Scenarios¶
β Scenario 1: Perfect Recall (Score: 1.0)
All Facts Retrieved
Expected Output:
"The Great Wall of China is over 13,000 miles long. It was built over many centuries, starting in the 7th century BC."
Retrieved Context:
- "The Great Wall of China stretches over 13,000 miles."
- "Construction began in the 7th century BC."
- "Multiple dynasties contributed to building the wall over centuries."
Analysis:
| Statement | Verdict |
|---|---|
| Over 13,000 miles long | β Supported |
| Built over many centuries | β Supported |
| Starting in 7th century BC | β Supported |
Final Score: 3 / 3 = 1.0
β οΈ Scenario 2: Partial Recall (Score: 0.5)
Missing Information
Expected Output:
"Water boils at 100Β°C at sea level. At higher altitudes, it boils at lower temperatures due to reduced pressure."
Retrieved Context:
- "Water boils at 100 degrees Celsius under standard conditions."
- "Water is essential for life on Earth."
Analysis:
| Statement | Verdict |
|---|---|
| Boils at 100Β°C at sea level | β Supported |
| Higher altitudes = lower boiling point | β Not found |
| Due to reduced pressure | β Not found |
Final Score: 1 / 3 = 0.33
The altitude/pressure relationship wasn't retrieved.
β Scenario 3: Poor Recall (Score: 0.0)
Critical Information Missing
Expected Output:
"Einstein developed the theory of relativity and won the Nobel Prize for the photoelectric effect."
Retrieved Context:
- "Albert Einstein was a famous physicist."
- "Einstein was born in Germany in 1879."
Analysis:
| Statement | Verdict |
|---|---|
| Developed theory of relativity | β Not found |
| Won Nobel Prize | β Not found |
| For photoelectric effect | β Not found |
Final Score: 0 / 3 = 0.0
None of the key facts from the expected answer were retrieved.
Why It Matters¶
Ensures retrieval captures all necessary information, not just some of it.
Evaluates retrieval from the answer's perspectiveβdid we get what's needed to answer correctly?
Identifies exactly which expected facts weren't retrieved, guiding retrieval improvements.
Quick Reference¶
TL;DR
Contextual Recall = Does the retrieved context contain all facts from the expected answer?
- Use it when: You have ground truth and want to measure retrieval completeness
- Score interpretation: Higher = more expected facts found in context
- Key insight: Low recall means the retriever missed important information
-
API Reference
-
Related Metrics
Contextual Precision Β· Contextual Relevancy Β· Answer Completeness