Skip to content

Contextual Utilization

Measure the efficiency of context usage in generation
LLM-Powered Knowledge Single Turn Retrieval

At a Glance

🎯
Score Range
0.0 ──────── 1.0
Ratio of utilized chunks

Default Threshold
0.5
Pass/fail cutoff
📋
Required Inputs
query actual_output retrieved_content
Answer + context required

What It Measures

Contextual Utilization measures the efficiency of context usage—what proportion of relevant retrieved chunks were actually used in the generated answer. Low utilization means relevant information was retrieved but ignored.

Score Interpretation
1.0 All relevant chunks were utilized
0.7+ Good utilization, minor waste
0.5 Half of relevant chunks unused
< 0.5 Significant waste—relevant info ignored
✅ Use When
  • Optimizing context window usage
  • Debugging incomplete answers
  • Identifying generation issues
  • Measuring retrieval efficiency
❌ Don't Use When
  • No actual_output available
  • Evaluating retrieval only
  • Testing factual correctness
  • All chunks are equally important

Utilization vs Faithfulness

Contextual Utilization asks: "Was the relevant context actually used?" Faithfulness asks: "Is the answer grounded in context?"

High Faithfulness + Low Utilization = Answer is correct but incomplete (missed relevant info).


How It Works

The metric evaluates which relevant chunks were actually utilized in the answer.

Step-by-Step Process

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Query]
        B[Retrieved Chunks]
        C[Generated Answer]
    end

    subgraph FILTER["🔍 Step 1: Identify Relevant"]
        D[Check Chunk Relevancy]
        E["Relevant Chunks Only"]
    end

    subgraph CHECK["⚖️ Step 2: Check Utilization"]
        F[Compare to Answer]
        G1["Chunk 1: ✓/✗"]
        G2["Chunk 2: ✓/✗"]
        G3["Chunk 3: ✓/✗"]
        GN["..."]
    end

    subgraph SCORE["📊 Step 3: Scoring"]
        H["Count Utilized"]
        I["Calculate Ratio"]
        J["Final Score"]
    end

    A & B --> D
    D --> E
    E --> F
    C --> F
    F --> G1 & G2 & G3 & GN
    G1 & G2 & G3 & GN --> H
    H --> I
    I --> J

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style FILTER stroke:#3b82f6,stroke-width:2px
    style CHECK stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style J fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

Each relevant chunk is checked for utilization.

✅ UTILIZED
1

Information from this chunk appears in the generated answer.

❌ NOT UTILIZED
0

Relevant chunk was ignored—information not used in answer.

Score Formula

score = utilized_chunks / total_relevant_chunks
Only relevant chunks are counted—irrelevant chunks don't affect the score.


Configuration

Parameter Type Default Description
mode EvaluationMode GRANULAR Evaluation detail level

Relevance Filtering

The metric first filters to only relevant chunks (using the same logic as Contextual Relevancy), then checks which of those were utilized. This means irrelevant chunks don't penalize or inflate the score.


Code Examples

from axion.metrics import ContextualUtilization
from axion.dataset import DatasetItem

metric = ContextualUtilization()

item = DatasetItem(
    query="What are the health benefits of green tea?",
    actual_output="Green tea contains antioxidants that reduce inflammation.",
    retrieved_content=[
        "Green tea is rich in antioxidants.",                    # Relevant, utilized
        "Antioxidants help reduce inflammation.",                # Relevant, utilized
        "Green tea can boost metabolism.",                       # Relevant, NOT utilized
        "Tea originated in China thousands of years ago.",       # Not relevant
    ],
)

result = await metric.execute(item)
print(result.pretty())
# Score: 2/3 = 0.67 (2 of 3 relevant chunks utilized)
from axion.metrics import ContextualUtilization

metric = ContextualUtilization()

item = DatasetItem(
    query="What is Python?",
    actual_output="Python is a high-level programming language created by Guido van Rossum, known for its readability.",
    retrieved_content=[
        "Python is a high-level programming language.",          # Utilized
        "Guido van Rossum created Python.",                      # Utilized
        "Python emphasizes code readability.",                   # Utilized
    ],
)

result = await metric.execute(item)
# Score: 1.0 (all relevant chunks utilized)
from axion.metrics import ContextualUtilization
from axion.runners import MetricRunner

metric = ContextualUtilization()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"Utilization: {item_result.score:.0%}")
    print(f"Used: {item_result.signals.utilized_chunks}/{item_result.signals.total_relevant_chunks}")
    for chunk in item_result.signals.chunk_breakdown:
        status = "✅" if chunk.is_utilized else "❌"
        print(f"  {status} {chunk.chunk_text[:40]}...")

Metric Diagnostics

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown
📊 ContextualUtilizationResult Structure
ContextualUtilizationResult(
{
    "utilization_score": 0.67,
    "total_relevant_chunks": 3,
    "utilized_chunks": 2,
    "utilization_rate": "66.7%",
    "chunk_breakdown": [
        {
            "chunk_text": "Green tea is rich in antioxidants.",
            "is_utilized": true
        },
        {
            "chunk_text": "Antioxidants help reduce inflammation.",
            "is_utilized": true
        },
        {
            "chunk_text": "Green tea can boost metabolism.",
            "is_utilized": false
        }
    ]
}
)

Signal Fields

Field Type Description
utilization_score float Ratio of utilized chunks (0.0-1.0)
total_relevant_chunks int Relevant chunks in context
utilized_chunks int Chunks actually used in answer
utilization_rate str Human-readable percentage
chunk_breakdown List Per-chunk (relevant only) details

Chunk Breakdown Fields

Field Type Description
chunk_text str The relevant chunk content
is_utilized bool Whether chunk was used in answer

Example Scenarios

✅ Scenario 1: Full Utilization (Score: 1.0)

All Relevant Info Used

Query:

"What is the boiling point of water?"

Retrieved Context:

  1. "Water boils at 100°C at sea level." ✅ Relevant
  2. "This is equivalent to 212°F." ✅ Relevant
  3. "Ice cream is a popular dessert." ❌ Not relevant

Generated Answer:

"Water boils at 100°C (212°F) at sea level."

Analysis:

Relevant Chunk Utilized
Boils at 100°C ✅ Used
Equivalent to 212°F ✅ Used

Final Score: 2 / 2 = 1.0

⚠️ Scenario 2: Partial Utilization (Score: 0.5)

Relevant Info Ignored

Query:

"What are the benefits of exercise?"

Retrieved Context:

  1. "Exercise improves cardiovascular health." ✅ Relevant
  2. "Regular exercise boosts mood and energy." ✅ Relevant
  3. "Exercise helps with weight management." ✅ Relevant
  4. "Gyms offer various equipment." ❌ Not relevant

Generated Answer:

"Exercise improves cardiovascular health and boosts mood."

Analysis:

Relevant Chunk Utilized
Cardiovascular health ✅ Used
Mood and energy ✅ Used (partial)
Weight management ❌ Not used

Final Score: 2 / 3 = 0.67

Weight management benefit was retrieved but not mentioned.

❌ Scenario 3: Poor Utilization (Score: 0.25)

Most Relevant Info Wasted

Query:

"Explain the causes of World War I."

Retrieved Context:

  1. "Assassination of Archduke Franz Ferdinand triggered WWI." ✅ Relevant
  2. "Alliance systems escalated regional conflicts." ✅ Relevant
  3. "Nationalism and imperialism created tensions." ✅ Relevant
  4. "The war lasted from 1914 to 1918." ✅ Relevant

Generated Answer:

"World War I began after the assassination of Archduke Franz Ferdinand."

Analysis:

Relevant Chunk Utilized
Assassination ✅ Used
Alliance systems ❌ Not used
Nationalism/imperialism ❌ Not used
Duration ❌ Not used

Final Score: 1 / 4 = 0.25

Only one cause mentioned despite retrieving multiple.


Why It Matters

🎯 Answer Completeness

Low utilization often indicates incomplete answers that miss relevant information.

💰 Efficiency

Retrieved content costs tokens. Low utilization = wasted context window space.

🔧 Debug Generation

If retrieval is good but utilization is low, the problem is in generation, not retrieval.


Quick Reference

TL;DR

Contextual Utilization = Was the relevant retrieved context actually used in the answer?

  • Use it when: Debugging incomplete answers or optimizing context usage
  • Score interpretation: Higher = more efficient use of retrieved information
  • Key insight: Measures generation efficiency, not retrieval quality