Contextual Sufficiency¶

Evaluate if retrieved context contains enough information to answer the query
LLM-Powered Knowledge Single Turn Retrieval

At a Glance¶

🎯
Score Range
0.0 or 1.0
Binary sufficiency verdict

⚡
Default Threshold
0.5
Pass/fail cutoff

📋
Required Inputs
query retrieved_content
No answer required

What It Measures

Contextual Sufficiency evaluates whether the retrieved context contains enough information to fully answer the user's query. Unlike other metrics that measure partial coverage, this is a binary judgment: either the context is sufficient or it isn't.

Score	Interpretation
1.0	Context is sufficient to answer the query
0.0	Context is insufficient—information missing

✅ Use When

Diagnosing retrieval quality
Testing retrieval before generation
Identifying information gaps
Deciding when to expand search

❌ Don't Use When

Need granular coverage scores
Evaluating answer quality
Comparing retrieval strategies
Need partial credit

RAG Evaluation Suite

Contextual Sufficiency asks: "Is there enough context to answer this question?"

Related retrieval metrics:

Contextual Relevancy: Are chunks relevant?
Contextual Recall: Are expected facts present?
Contextual Utilization: Was the context actually used?

How It Works

Computation Verdict System

The metric uses an LLM to make a binary judgment about context sufficiency.

Step-by-Step Process¶

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Query]
        B[Retrieved Context]
    end

    subgraph JUDGE["⚖️ Sufficiency Judgment"]
        C[RAGAnalyzer Engine]
        D["Can this context answer the query?"]
        E["Binary Verdict"]
    end

    subgraph OUTPUT["📊 Result"]
        F["1.0 = Sufficient"]
        G["0.0 = Insufficient"]
        H["Reasoning Provided"]
    end

    A & B --> C
    C --> D
    D --> E
    E --> F & G
    F & G --> H

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style JUDGE stroke:#f59e0b,stroke-width:2px
    style OUTPUT stroke:#10b981,stroke-width:2px
    style E fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

A single binary verdict for the entire context.

✅ SUFFICIENT

1.0

Context contains all necessary information to answer the query completely.

❌ INSUFFICIENT

0.0

Context is missing critical information needed to answer the query.

Diagnostic Purpose

This metric helps diagnose retrieval issues independent of generation. If sufficiency is low but faithfulness is high, your retriever needs improvement.

Configuration¶

Parameters

Parameter	Type	Default	Description
`mode`	`EvaluationMode`	`GRANULAR`	Evaluation detail level

Binary by Design

Unlike other metrics that provide granular scores, Sufficiency is intentionally binary. For partial coverage scores, use Contextual Recall or Contextual Relevancy.

Code Examples¶

Basic Usage Insufficient Example With Runner

from axion.metrics import ContextualSufficiency
from axion.dataset import DatasetItem

metric = ContextualSufficiency()

item = DatasetItem(
    query="What is the boiling point of water?",
    retrieved_content=[
        "Water boils at 100 degrees Celsius at sea level.",
        "This is equivalent to 212 degrees Fahrenheit.",
    ],
)

result = await metric.execute(item)
print(result.pretty())
# Score: 1.0 (context is sufficient)

from axion.metrics import ContextualSufficiency

metric = ContextualSufficiency()

item = DatasetItem(
    query="What is the boiling point of water at high altitude?",
    retrieved_content=[
        "Water boils at 100 degrees Celsius at sea level.",
    ],
)

result = await metric.execute(item)
# Score: 0.0 (missing altitude information)
print(result.signals.reasoning)
# "Context only mentions sea level; no information about altitude effects."

from axion.metrics import ContextualSufficiency
from axion.runners import MetricRunner

metric = ContextualSufficiency()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

sufficient_count = sum(1 for r in results if r.score == 1.0)
print(f"Sufficient: {sufficient_count}/{len(results)}")

for item_result in results:
    if item_result.score == 0.0:
        print(f"⚠️ Insufficient for: {item_result.signals.query[:50]}...")
        print(f"   Reason: {item_result.signals.reasoning}")

Metric Diagnostics¶

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown

📊 ContextualSufficiencyResult Structure

ContextualSufficiencyResult(
{
    "sufficiency_score": 1.0,
    "is_sufficient": true,
    "reasoning": "The context fully addresses the query by providing the boiling point of water (100°C) and its Fahrenheit equivalent (212°F).",
    "query": "What is the boiling point of water?",
    "context": "Water boils at 100 degrees Celsius at sea level. This is equivalent to 212 degrees Fahrenheit."
}
)

Signal Fields¶

Field	Type	Description
`sufficiency_score`	`float`	Binary score (1.0 or 0.0)
`is_sufficient`	`bool`	Whether context is sufficient
`reasoning`	`str`	Explanation for the verdict
`query`	`str`	The user query (preview)
`context`	`str`	The retrieved context (preview)

Example Scenarios¶

✅ Scenario 1: Sufficient Context (Score: 1.0)

Complete Information

Query:

"Who invented the telephone and when?"

Retrieved Context:

"Alexander Graham Bell invented the telephone in 1876. He was granted the patent on March 7th of that year."

Analysis:

✅ Inventor identified: Alexander Graham Bell
✅ Year provided: 1876
✅ Additional detail: Patent date

Verdict: Sufficient

Reasoning: "The context directly answers both parts of the query—who (Alexander Graham Bell) and when (1876)."

Final Score: 1.0

❌ Scenario 2: Insufficient - Missing Key Info (Score: 0.0)

Critical Information Missing

Query:

"What are the side effects of aspirin?"

Retrieved Context:

"Aspirin is a common pain reliever. It belongs to a class of drugs called NSAIDs. It can be purchased over the counter."

Analysis:

✅ Drug identification: Correct
✅ Drug class: NSAIDs
❌ Side effects: Not mentioned

Verdict: Insufficient

Reasoning: "The context describes what aspirin is but does not mention any side effects, which is the core of the query."

Final Score: 0.0

❌ Scenario 3: Insufficient - Partial Answer (Score: 0.0)

Incomplete Coverage

Query:

"Compare the populations of Tokyo and New York City."

Retrieved Context:

"Tokyo is the capital of Japan with a metropolitan population of over 37 million people, making it the world's most populous metropolitan area."

Analysis:

✅ Tokyo population: Provided
❌ NYC population: Missing
❌ Comparison: Cannot be made

Verdict: Insufficient

Reasoning: "Context only provides Tokyo's population. NYC population is missing, making a comparison impossible."

Final Score: 0.0

Why It Matters¶

🔍 Retrieval Diagnosis

Quickly identify if poor answers stem from insufficient retrieval, not generation quality.

🔄 Adaptive Search

Use as a signal to expand search or trigger alternative retrieval strategies.

⚡ Pre-Generation Check

Evaluate context before generating—don't waste tokens on insufficient information.

Quick Reference¶

TL;DR

Contextual Sufficiency = Is there enough context to fully answer the query?

Use it when: Diagnosing retrieval gaps or deciding to expand search
Score interpretation: 1.0 = sufficient, 0.0 = insufficient (binary)
Key insight: Identifies "missing information" problems in retrieval

API Reference

axion.metrics.ContextualSufficiency
Related Metrics

Contextual Recall · Contextual Relevancy · Contextual Utilization