Contextual Recall¶

Measure if retrieved context supports the expected answer
LLM-Powered Knowledge Single Turn Retrieval

At a Glance¶

🎯
Score Range
0.0 ──────── 1.0
Ratio of supported statements

⚡
Default Threshold
0.5
Pass/fail cutoff

📋
Required Inputs
expected_output retrieved_content
Ground truth required

What It Measures

Contextual Recall evaluates whether the retrieved context contains sufficient information to support the expected answer. It extracts statements from the ground truth and checks if each is supported by the retrieved chunks. High recall means the retrieval didn't miss important information.

Score	Interpretation
1.0	All expected facts are in retrieved context
0.7+	Most expected facts supported, minor gaps
0.5	Half the expected facts missing from context
< 0.5	Significant information not retrieved

✅ Use When

You have ground truth answers
Evaluating retrieval completeness
Testing if critical info is retrieved
Debugging "information not found" errors

❌ Don't Use When

No expected_output available
Multiple valid answers exist
Testing retrieval ranking (use Precision)
Evaluating generation quality

RAG Evaluation Suite

Contextual Recall asks: "Does the context contain everything needed to answer correctly?"

Related retrieval metrics:

Contextual Relevancy: Are chunks relevant to the query?
Contextual Precision: Are useful chunks ranked higher?
Contextual Sufficiency: Is there enough info overall?

How It Works

Computation Verdict System

The metric extracts factual statements from the expected answer and checks context support.

Step-by-Step Process¶

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Expected Output]
        B[Retrieved Context]
    end

    subgraph EXTRACT["🔍 Step 1: Statement Extraction"]
        C[Extract Factual Statements]
        D["Ground Truth Statements"]
    end

    subgraph CHECK["⚖️ Step 2: Support Check"]
        E[Check Against Context]
        F1["Stmt 1: ✓/✗"]
        F2["Stmt 2: ✓/✗"]
        F3["Stmt 3: ✓/✗"]
        FN["..."]
    end

    subgraph SCORE["📊 Step 3: Scoring"]
        G["Count Supported"]
        H["Calculate Ratio"]
        I["Final Score"]
    end

    A --> C
    C --> D
    D --> E
    B --> E
    E --> F1 & F2 & F3 & FN
    F1 & F2 & F3 & FN --> G
    G --> H
    H --> I

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style EXTRACT stroke:#3b82f6,stroke-width:2px
    style CHECK stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

Each ground truth statement receives a support verdict.

✅ SUPPORTED

1

Statement from expected answer is found in retrieved context.

❌ NOT SUPPORTED

0

Statement from expected answer is missing from retrieved context.

Score Formula

score = supported_statements / total_statements

Configuration¶

Parameters

Parameter	Type	Default	Description
`mode`	`EvaluationMode`	`GRANULAR`	Evaluation detail level

Ground Truth Focus

Unlike Contextual Relevancy (which asks "is this chunk relevant?"), Recall asks "is this expected fact present?" It measures retrieval from the answer's perspective.

Code Examples¶

Basic Usage Complete Recall With Runner

from axion.metrics import ContextualRecall
from axion.dataset import DatasetItem

metric = ContextualRecall()

item = DatasetItem(
    expected_output="Paris is the capital of France. It has a population of about 2 million.",
    retrieved_content=[
        "Paris is the capital and largest city of France.",
        "The Eiffel Tower is located in Paris.",
    ],
)

result = await metric.execute(item)
print(result.pretty())
# Score: 0.5 (capital fact supported, population fact missing)

from axion.metrics import ContextualRecall

metric = ContextualRecall()

item = DatasetItem(
    expected_output="Python was created by Guido van Rossum in 1991.",
    retrieved_content=[
        "Python is a programming language created by Guido van Rossum.",
        "Python was first released in 1991.",
        "Python emphasizes code readability.",
    ],
)

result = await metric.execute(item)
# Score: 1.0 (both creator and year facts are in context)

from axion.metrics import ContextualRecall
from axion.runners import MetricRunner

metric = ContextualRecall()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"Score: {item_result.score}")
    print(f"Supported: {item_result.signals.supported_gt_statements}/{item_result.signals.total_gt_statements}")
    for stmt in item_result.signals.statement_breakdown:
        status = "✅" if stmt.is_supported else "❌"
        print(f"  {status} {stmt.statement_text}")

Metric Diagnostics¶

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown

📊 ContextualRecallResult Structure

ContextualRecallResult(
{
    "recall_score": 0.5,
    "total_gt_statements": 2,
    "supported_gt_statements": 1,
    "statement_breakdown": [
        {
            "statement_text": "Paris is the capital of France",
            "is_supported": true
        },
        {
            "statement_text": "Paris has a population of about 2 million",
            "is_supported": false
        }
    ]
}
)

Signal Fields¶

Field	Type	Description
`recall_score`	`float`	Ratio of supported statements (0.0-1.0)
`total_gt_statements`	`int`	Factual statements from expected output
`supported_gt_statements`	`int`	Statements found in context
`statement_breakdown`	`List`	Per-statement verdict details

Statement Breakdown Fields¶

Field	Type	Description
`statement_text`	`str`	The ground truth statement
`is_supported`	`bool`	Whether statement is in context

Example Scenarios¶

✅ Scenario 1: Perfect Recall (Score: 1.0)

All Facts Retrieved

Expected Output:

"The Great Wall of China is over 13,000 miles long. It was built over many centuries, starting in the 7th century BC."

Retrieved Context:

"The Great Wall of China stretches over 13,000 miles."
"Construction began in the 7th century BC."
"Multiple dynasties contributed to building the wall over centuries."

Analysis:

Statement	Verdict
Over 13,000 miles long	✅ Supported
Built over many centuries	✅ Supported
Starting in 7th century BC	✅ Supported

Final Score: 3 / 3 = 1.0

⚠️ Scenario 2: Partial Recall (Score: 0.5)

Missing Information

Expected Output:

"Water boils at 100°C at sea level. At higher altitudes, it boils at lower temperatures due to reduced pressure."

Retrieved Context:

"Water boils at 100 degrees Celsius under standard conditions."
"Water is essential for life on Earth."

Analysis:

Statement	Verdict
Boils at 100°C at sea level	✅ Supported
Higher altitudes = lower boiling point	❌ Not found
Due to reduced pressure	❌ Not found

Final Score: 1 / 3 = 0.33

The altitude/pressure relationship wasn't retrieved.

❌ Scenario 3: Poor Recall (Score: 0.0)

Critical Information Missing

Expected Output:

"Einstein developed the theory of relativity and won the Nobel Prize for the photoelectric effect."

Retrieved Context:

"Albert Einstein was a famous physicist."
"Einstein was born in Germany in 1879."

Analysis:

Statement	Verdict
Developed theory of relativity	❌ Not found
Won Nobel Prize	❌ Not found
For photoelectric effect	❌ Not found

Final Score: 0 / 3 = 0.0

None of the key facts from the expected answer were retrieved.

Why It Matters¶

🔍 Completeness Check

Ensures retrieval captures all necessary information, not just some of it.

🎯 Answer-Focused

Evaluates retrieval from the answer's perspective—did we get what's needed to answer correctly?

🐛 Debug Missing Info

Identifies exactly which expected facts weren't retrieved, guiding retrieval improvements.

Quick Reference¶

TL;DR

Contextual Recall = Does the retrieved context contain all facts from the expected answer?

Use it when: You have ground truth and want to measure retrieval completeness
Score interpretation: Higher = more expected facts found in context
Key insight: Low recall means the retriever missed important information

API Reference

axion.metrics.ContextualRecall
Related Metrics

Contextual Precision · Contextual Relevancy · Answer Completeness