Contextual Utilization¶
LLM-Powered Knowledge Single Turn Retrieval
At a Glance¶
Score Range
0.0 ──────── 1.0Ratio of utilized chunks
Default Threshold
0.5Pass/fail cutoff
Required Inputs
query actual_output retrieved_contentAnswer + context required
What It Measures
Contextual Utilization measures the efficiency of context usage—what proportion of relevant retrieved chunks were actually used in the generated answer. Low utilization means relevant information was retrieved but ignored.
| Score | Interpretation |
|---|---|
| 1.0 | All relevant chunks were utilized |
| 0.7+ | Good utilization, minor waste |
| 0.5 | Half of relevant chunks unused |
| < 0.5 | Significant waste—relevant info ignored |
- Optimizing context window usage
- Debugging incomplete answers
- Identifying generation issues
- Measuring retrieval efficiency
- No actual_output available
- Evaluating retrieval only
- Testing factual correctness
- All chunks are equally important
Utilization vs Faithfulness
Contextual Utilization asks: "Was the relevant context actually used?" Faithfulness asks: "Is the answer grounded in context?"
High Faithfulness + Low Utilization = Answer is correct but incomplete (missed relevant info).
How It Works
The metric evaluates which relevant chunks were actually utilized in the answer.
Step-by-Step Process¶
flowchart TD
subgraph INPUT["📥 Inputs"]
A[Query]
B[Retrieved Chunks]
C[Generated Answer]
end
subgraph FILTER["🔍 Step 1: Identify Relevant"]
D[Check Chunk Relevancy]
E["Relevant Chunks Only"]
end
subgraph CHECK["⚖️ Step 2: Check Utilization"]
F[Compare to Answer]
G1["Chunk 1: ✓/✗"]
G2["Chunk 2: ✓/✗"]
G3["Chunk 3: ✓/✗"]
GN["..."]
end
subgraph SCORE["📊 Step 3: Scoring"]
H["Count Utilized"]
I["Calculate Ratio"]
J["Final Score"]
end
A & B --> D
D --> E
E --> F
C --> F
F --> G1 & G2 & G3 & GN
G1 & G2 & G3 & GN --> H
H --> I
I --> J
style INPUT stroke:#1E3A5F,stroke-width:2px
style FILTER stroke:#3b82f6,stroke-width:2px
style CHECK stroke:#f59e0b,stroke-width:2px
style SCORE stroke:#10b981,stroke-width:2px
style J fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff
Each relevant chunk is checked for utilization.
Information from this chunk appears in the generated answer.
Relevant chunk was ignored—information not used in answer.
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
mode |
EvaluationMode |
GRANULAR |
Evaluation detail level |
Relevance Filtering
The metric first filters to only relevant chunks (using the same logic as Contextual Relevancy), then checks which of those were utilized. This means irrelevant chunks don't penalize or inflate the score.
Code Examples¶
from axion.metrics import ContextualUtilization
from axion.dataset import DatasetItem
metric = ContextualUtilization()
item = DatasetItem(
query="What are the health benefits of green tea?",
actual_output="Green tea contains antioxidants that reduce inflammation.",
retrieved_content=[
"Green tea is rich in antioxidants.", # Relevant, utilized
"Antioxidants help reduce inflammation.", # Relevant, utilized
"Green tea can boost metabolism.", # Relevant, NOT utilized
"Tea originated in China thousands of years ago.", # Not relevant
],
)
result = await metric.execute(item)
print(result.pretty())
# Score: 2/3 = 0.67 (2 of 3 relevant chunks utilized)
from axion.metrics import ContextualUtilization
metric = ContextualUtilization()
item = DatasetItem(
query="What is Python?",
actual_output="Python is a high-level programming language created by Guido van Rossum, known for its readability.",
retrieved_content=[
"Python is a high-level programming language.", # Utilized
"Guido van Rossum created Python.", # Utilized
"Python emphasizes code readability.", # Utilized
],
)
result = await metric.execute(item)
# Score: 1.0 (all relevant chunks utilized)
from axion.metrics import ContextualUtilization
from axion.runners import MetricRunner
metric = ContextualUtilization()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)
for item_result in results:
print(f"Utilization: {item_result.score:.0%}")
print(f"Used: {item_result.signals.utilized_chunks}/{item_result.signals.total_relevant_chunks}")
for chunk in item_result.signals.chunk_breakdown:
status = "✅" if chunk.is_utilized else "❌"
print(f" {status} {chunk.chunk_text[:40]}...")
Metric Diagnostics¶
Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.
result = await metric.execute(item)
print(result.pretty()) # Human-readable summary
result.signals # Full diagnostic breakdown
📊 ContextualUtilizationResult Structure
ContextualUtilizationResult(
{
"utilization_score": 0.67,
"total_relevant_chunks": 3,
"utilized_chunks": 2,
"utilization_rate": "66.7%",
"chunk_breakdown": [
{
"chunk_text": "Green tea is rich in antioxidants.",
"is_utilized": true
},
{
"chunk_text": "Antioxidants help reduce inflammation.",
"is_utilized": true
},
{
"chunk_text": "Green tea can boost metabolism.",
"is_utilized": false
}
]
}
)
Signal Fields¶
| Field | Type | Description |
|---|---|---|
utilization_score |
float |
Ratio of utilized chunks (0.0-1.0) |
total_relevant_chunks |
int |
Relevant chunks in context |
utilized_chunks |
int |
Chunks actually used in answer |
utilization_rate |
str |
Human-readable percentage |
chunk_breakdown |
List |
Per-chunk (relevant only) details |
Chunk Breakdown Fields¶
| Field | Type | Description |
|---|---|---|
chunk_text |
str |
The relevant chunk content |
is_utilized |
bool |
Whether chunk was used in answer |
Example Scenarios¶
✅ Scenario 1: Full Utilization (Score: 1.0)
All Relevant Info Used
Query:
"What is the boiling point of water?"
Retrieved Context:
- "Water boils at 100°C at sea level." ✅ Relevant
- "This is equivalent to 212°F." ✅ Relevant
- "Ice cream is a popular dessert." ❌ Not relevant
Generated Answer:
"Water boils at 100°C (212°F) at sea level."
Analysis:
| Relevant Chunk | Utilized |
|---|---|
| Boils at 100°C | ✅ Used |
| Equivalent to 212°F | ✅ Used |
Final Score: 2 / 2 = 1.0
⚠️ Scenario 2: Partial Utilization (Score: 0.5)
Relevant Info Ignored
Query:
"What are the benefits of exercise?"
Retrieved Context:
- "Exercise improves cardiovascular health." ✅ Relevant
- "Regular exercise boosts mood and energy." ✅ Relevant
- "Exercise helps with weight management." ✅ Relevant
- "Gyms offer various equipment." ❌ Not relevant
Generated Answer:
"Exercise improves cardiovascular health and boosts mood."
Analysis:
| Relevant Chunk | Utilized |
|---|---|
| Cardiovascular health | ✅ Used |
| Mood and energy | ✅ Used (partial) |
| Weight management | ❌ Not used |
Final Score: 2 / 3 = 0.67
Weight management benefit was retrieved but not mentioned.
❌ Scenario 3: Poor Utilization (Score: 0.25)
Most Relevant Info Wasted
Query:
"Explain the causes of World War I."
Retrieved Context:
- "Assassination of Archduke Franz Ferdinand triggered WWI." ✅ Relevant
- "Alliance systems escalated regional conflicts." ✅ Relevant
- "Nationalism and imperialism created tensions." ✅ Relevant
- "The war lasted from 1914 to 1918." ✅ Relevant
Generated Answer:
"World War I began after the assassination of Archduke Franz Ferdinand."
Analysis:
| Relevant Chunk | Utilized |
|---|---|
| Assassination | ✅ Used |
| Alliance systems | ❌ Not used |
| Nationalism/imperialism | ❌ Not used |
| Duration | ❌ Not used |
Final Score: 1 / 4 = 0.25
Only one cause mentioned despite retrieving multiple.
Why It Matters¶
Low utilization often indicates incomplete answers that miss relevant information.
Retrieved content costs tokens. Low utilization = wasted context window space.
If retrieval is good but utilization is low, the problem is in generation, not retrieval.
Quick Reference¶
TL;DR
Contextual Utilization = Was the relevant retrieved context actually used in the answer?
- Use it when: Debugging incomplete answers or optimizing context usage
- Score interpretation: Higher = more efficient use of retrieved information
- Key insight: Measures generation efficiency, not retrieval quality
-
API Reference
-
Related Metrics
Faithfulness · Contextual Relevancy · Answer Completeness