Contextual Sufficiency¶
LLM-Powered Knowledge Single Turn Retrieval
At a Glance¶
Score Range
0.0 or 1.0Binary sufficiency verdict
Default Threshold
0.5Pass/fail cutoff
Required Inputs
query retrieved_contentNo answer required
What It Measures
Contextual Sufficiency evaluates whether the retrieved context contains enough information to fully answer the user's query. Unlike other metrics that measure partial coverage, this is a binary judgment: either the context is sufficient or it isn't.
| Score | Interpretation |
|---|---|
| 1.0 | Context is sufficient to answer the query |
| 0.0 | Context is insufficient—information missing |
- Diagnosing retrieval quality
- Testing retrieval before generation
- Identifying information gaps
- Deciding when to expand search
- Need granular coverage scores
- Evaluating answer quality
- Comparing retrieval strategies
- Need partial credit
RAG Evaluation Suite
Contextual Sufficiency asks: "Is there enough context to answer this question?"
Related retrieval metrics:
- Contextual Relevancy: Are chunks relevant?
- Contextual Recall: Are expected facts present?
- Contextual Utilization: Was the context actually used?
How It Works
The metric uses an LLM to make a binary judgment about context sufficiency.
Step-by-Step Process¶
flowchart TD
subgraph INPUT["📥 Inputs"]
A[Query]
B[Retrieved Context]
end
subgraph JUDGE["⚖️ Sufficiency Judgment"]
C[RAGAnalyzer Engine]
D["Can this context answer the query?"]
E["Binary Verdict"]
end
subgraph OUTPUT["📊 Result"]
F["1.0 = Sufficient"]
G["0.0 = Insufficient"]
H["Reasoning Provided"]
end
A & B --> C
C --> D
D --> E
E --> F & G
F & G --> H
style INPUT stroke:#1E3A5F,stroke-width:2px
style JUDGE stroke:#f59e0b,stroke-width:2px
style OUTPUT stroke:#10b981,stroke-width:2px
style E fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff
A single binary verdict for the entire context.
Context contains all necessary information to answer the query completely.
Context is missing critical information needed to answer the query.
Diagnostic Purpose
This metric helps diagnose retrieval issues independent of generation. If sufficiency is low but faithfulness is high, your retriever needs improvement.
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
mode |
EvaluationMode |
GRANULAR |
Evaluation detail level |
Binary by Design
Unlike other metrics that provide granular scores, Sufficiency is intentionally binary. For partial coverage scores, use Contextual Recall or Contextual Relevancy.
Code Examples¶
from axion.metrics import ContextualSufficiency
from axion.dataset import DatasetItem
metric = ContextualSufficiency()
item = DatasetItem(
query="What is the boiling point of water?",
retrieved_content=[
"Water boils at 100 degrees Celsius at sea level.",
"This is equivalent to 212 degrees Fahrenheit.",
],
)
result = await metric.execute(item)
print(result.pretty())
# Score: 1.0 (context is sufficient)
from axion.metrics import ContextualSufficiency
metric = ContextualSufficiency()
item = DatasetItem(
query="What is the boiling point of water at high altitude?",
retrieved_content=[
"Water boils at 100 degrees Celsius at sea level.",
],
)
result = await metric.execute(item)
# Score: 0.0 (missing altitude information)
print(result.signals.reasoning)
# "Context only mentions sea level; no information about altitude effects."
from axion.metrics import ContextualSufficiency
from axion.runners import MetricRunner
metric = ContextualSufficiency()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)
sufficient_count = sum(1 for r in results if r.score == 1.0)
print(f"Sufficient: {sufficient_count}/{len(results)}")
for item_result in results:
if item_result.score == 0.0:
print(f"⚠️ Insufficient for: {item_result.signals.query[:50]}...")
print(f" Reason: {item_result.signals.reasoning}")
Metric Diagnostics¶
Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.
result = await metric.execute(item)
print(result.pretty()) # Human-readable summary
result.signals # Full diagnostic breakdown
📊 ContextualSufficiencyResult Structure
ContextualSufficiencyResult(
{
"sufficiency_score": 1.0,
"is_sufficient": true,
"reasoning": "The context fully addresses the query by providing the boiling point of water (100°C) and its Fahrenheit equivalent (212°F).",
"query": "What is the boiling point of water?",
"context": "Water boils at 100 degrees Celsius at sea level. This is equivalent to 212 degrees Fahrenheit."
}
)
Signal Fields¶
| Field | Type | Description |
|---|---|---|
sufficiency_score |
float |
Binary score (1.0 or 0.0) |
is_sufficient |
bool |
Whether context is sufficient |
reasoning |
str |
Explanation for the verdict |
query |
str |
The user query (preview) |
context |
str |
The retrieved context (preview) |
Example Scenarios¶
âś… Scenario 1: Sufficient Context (Score: 1.0)
Complete Information
Query:
"Who invented the telephone and when?"
Retrieved Context:
"Alexander Graham Bell invented the telephone in 1876. He was granted the patent on March 7th of that year."
Analysis:
- âś… Inventor identified: Alexander Graham Bell
- âś… Year provided: 1876
- âś… Additional detail: Patent date
Verdict: Sufficient
Reasoning: "The context directly answers both parts of the query—who (Alexander Graham Bell) and when (1876)."
Final Score: 1.0
❌ Scenario 2: Insufficient - Missing Key Info (Score: 0.0)
Critical Information Missing
Query:
"What are the side effects of aspirin?"
Retrieved Context:
"Aspirin is a common pain reliever. It belongs to a class of drugs called NSAIDs. It can be purchased over the counter."
Analysis:
- âś… Drug identification: Correct
- âś… Drug class: NSAIDs
- ❌ Side effects: Not mentioned
Verdict: Insufficient
Reasoning: "The context describes what aspirin is but does not mention any side effects, which is the core of the query."
Final Score: 0.0
❌ Scenario 3: Insufficient - Partial Answer (Score: 0.0)
Incomplete Coverage
Query:
"Compare the populations of Tokyo and New York City."
Retrieved Context:
"Tokyo is the capital of Japan with a metropolitan population of over 37 million people, making it the world's most populous metropolitan area."
Analysis:
- âś… Tokyo population: Provided
- ❌ NYC population: Missing
- ❌ Comparison: Cannot be made
Verdict: Insufficient
Reasoning: "Context only provides Tokyo's population. NYC population is missing, making a comparison impossible."
Final Score: 0.0
Why It Matters¶
Quickly identify if poor answers stem from insufficient retrieval, not generation quality.
Use as a signal to expand search or trigger alternative retrieval strategies.
Evaluate context before generating—don't waste tokens on insufficient information.
Quick Reference¶
TL;DR
Contextual Sufficiency = Is there enough context to fully answer the query?
- Use it when: Diagnosing retrieval gaps or deciding to expand search
- Score interpretation: 1.0 = sufficient, 0.0 = insufficient (binary)
- Key insight: Identifies "missing information" problems in retrieval
-
API Reference
-
Related Metrics
Contextual Recall · Contextual Relevancy · Contextual Utilization