Skip to content

Contextual Recall

Measure if retrieved context supports the expected answer
LLM-Powered Knowledge Single Turn Retrieval

At a Glance

🎯
Score Range
0.0 ──────── 1.0
Ratio of supported statements
⚑
Default Threshold
0.5
Pass/fail cutoff
πŸ“‹
Required Inputs
expected_output retrieved_content
Ground truth required

What It Measures

Contextual Recall evaluates whether the retrieved context contains sufficient information to support the expected answer. It extracts statements from the ground truth and checks if each is supported by the retrieved chunks. High recall means the retrieval didn't miss important information.

Score Interpretation
1.0 All expected facts are in retrieved context
0.7+ Most expected facts supported, minor gaps
0.5 Half the expected facts missing from context
< 0.5 Significant information not retrieved
βœ… Use When
  • You have ground truth answers
  • Evaluating retrieval completeness
  • Testing if critical info is retrieved
  • Debugging "information not found" errors
❌ Don't Use When
  • No expected_output available
  • Multiple valid answers exist
  • Testing retrieval ranking (use Precision)
  • Evaluating generation quality

RAG Evaluation Suite

Contextual Recall asks: "Does the context contain everything needed to answer correctly?"

Related retrieval metrics:


How It Works

The metric extracts factual statements from the expected answer and checks context support.

Step-by-Step Process

flowchart TD
    subgraph INPUT["πŸ“₯ Inputs"]
        A[Expected Output]
        B[Retrieved Context]
    end

    subgraph EXTRACT["πŸ” Step 1: Statement Extraction"]
        C[Extract Factual Statements]
        D["Ground Truth Statements"]
    end

    subgraph CHECK["βš–οΈ Step 2: Support Check"]
        E[Check Against Context]
        F1["Stmt 1: βœ“/βœ—"]
        F2["Stmt 2: βœ“/βœ—"]
        F3["Stmt 3: βœ“/βœ—"]
        FN["..."]
    end

    subgraph SCORE["πŸ“Š Step 3: Scoring"]
        G["Count Supported"]
        H["Calculate Ratio"]
        I["Final Score"]
    end

    A --> C
    C --> D
    D --> E
    B --> E
    E --> F1 & F2 & F3 & FN
    F1 & F2 & F3 & FN --> G
    G --> H
    H --> I

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style EXTRACT stroke:#3b82f6,stroke-width:2px
    style CHECK stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

Each ground truth statement receives a support verdict.

βœ… SUPPORTED
1

Statement from expected answer is found in retrieved context.

❌ NOT SUPPORTED
0

Statement from expected answer is missing from retrieved context.

Score Formula

score = supported_statements / total_statements

Configuration

Parameter Type Default Description
mode EvaluationMode GRANULAR Evaluation detail level

Ground Truth Focus

Unlike Contextual Relevancy (which asks "is this chunk relevant?"), Recall asks "is this expected fact present?" It measures retrieval from the answer's perspective.


Code Examples

from axion.metrics import ContextualRecall
from axion.dataset import DatasetItem

metric = ContextualRecall()

item = DatasetItem(
    expected_output="Paris is the capital of France. It has a population of about 2 million.",
    retrieved_content=[
        "Paris is the capital and largest city of France.",
        "The Eiffel Tower is located in Paris.",
    ],
)

result = await metric.execute(item)
print(result.pretty())
# Score: 0.5 (capital fact supported, population fact missing)
from axion.metrics import ContextualRecall

metric = ContextualRecall()

item = DatasetItem(
    expected_output="Python was created by Guido van Rossum in 1991.",
    retrieved_content=[
        "Python is a programming language created by Guido van Rossum.",
        "Python was first released in 1991.",
        "Python emphasizes code readability.",
    ],
)

result = await metric.execute(item)
# Score: 1.0 (both creator and year facts are in context)
from axion.metrics import ContextualRecall
from axion.runners import MetricRunner

metric = ContextualRecall()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"Score: {item_result.score}")
    print(f"Supported: {item_result.signals.supported_gt_statements}/{item_result.signals.total_gt_statements}")
    for stmt in item_result.signals.statement_breakdown:
        status = "βœ…" if stmt.is_supported else "❌"
        print(f"  {status} {stmt.statement_text}")

Metric Diagnostics

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was givenβ€”no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown
πŸ“Š ContextualRecallResult Structure
ContextualRecallResult(
{
    "recall_score": 0.5,
    "total_gt_statements": 2,
    "supported_gt_statements": 1,
    "statement_breakdown": [
        {
            "statement_text": "Paris is the capital of France",
            "is_supported": true
        },
        {
            "statement_text": "Paris has a population of about 2 million",
            "is_supported": false
        }
    ]
}
)

Signal Fields

Field Type Description
recall_score float Ratio of supported statements (0.0-1.0)
total_gt_statements int Factual statements from expected output
supported_gt_statements int Statements found in context
statement_breakdown List Per-statement verdict details

Statement Breakdown Fields

Field Type Description
statement_text str The ground truth statement
is_supported bool Whether statement is in context

Example Scenarios

βœ… Scenario 1: Perfect Recall (Score: 1.0)

All Facts Retrieved

Expected Output:

"The Great Wall of China is over 13,000 miles long. It was built over many centuries, starting in the 7th century BC."

Retrieved Context:

  1. "The Great Wall of China stretches over 13,000 miles."
  2. "Construction began in the 7th century BC."
  3. "Multiple dynasties contributed to building the wall over centuries."

Analysis:

Statement Verdict
Over 13,000 miles long βœ… Supported
Built over many centuries βœ… Supported
Starting in 7th century BC βœ… Supported

Final Score: 3 / 3 = 1.0

⚠️ Scenario 2: Partial Recall (Score: 0.5)

Missing Information

Expected Output:

"Water boils at 100Β°C at sea level. At higher altitudes, it boils at lower temperatures due to reduced pressure."

Retrieved Context:

  1. "Water boils at 100 degrees Celsius under standard conditions."
  2. "Water is essential for life on Earth."

Analysis:

Statement Verdict
Boils at 100Β°C at sea level βœ… Supported
Higher altitudes = lower boiling point ❌ Not found
Due to reduced pressure ❌ Not found

Final Score: 1 / 3 = 0.33

The altitude/pressure relationship wasn't retrieved.

❌ Scenario 3: Poor Recall (Score: 0.0)

Critical Information Missing

Expected Output:

"Einstein developed the theory of relativity and won the Nobel Prize for the photoelectric effect."

Retrieved Context:

  1. "Albert Einstein was a famous physicist."
  2. "Einstein was born in Germany in 1879."

Analysis:

Statement Verdict
Developed theory of relativity ❌ Not found
Won Nobel Prize ❌ Not found
For photoelectric effect ❌ Not found

Final Score: 0 / 3 = 0.0

None of the key facts from the expected answer were retrieved.


Why It Matters

πŸ” Completeness Check

Ensures retrieval captures all necessary information, not just some of it.

🎯 Answer-Focused

Evaluates retrieval from the answer's perspectiveβ€”did we get what's needed to answer correctly?

πŸ› Debug Missing Info

Identifies exactly which expected facts weren't retrieved, guiding retrieval improvements.


Quick Reference

TL;DR

Contextual Recall = Does the retrieved context contain all facts from the expected answer?

  • Use it when: You have ground truth and want to measure retrieval completeness
  • Score interpretation: Higher = more expected facts found in context
  • Key insight: Low recall means the retriever missed important information