Skip to content

Answer Completeness

Measure how completely the response covers expected content
LLM-Powered Knowledge Single Turn

At a Glance

🎯
Score Range
0.0 ──────── 1.0
Coverage ratio
⚑
Default Threshold
0.5
Pass/fail cutoff
πŸ“‹
Required Inputs
query actual_output expected_output
Reference answer required

What It Measures

Answer Completeness evaluates whether the response covers all the key aspects from the expected output. It answers: "Did the AI mention everything important from the reference answer?"

Score Interpretation
1.0 All aspects from expected output covered
0.7+ Most aspects covered, minor omissions
0.5 Half the expected content covered
< 0.5 Significant content missing
βœ… Use When
  • You have reference answers
  • Completeness matters more than brevity
  • Testing comprehensive responses
  • Evaluating educational content
❌ Don't Use When
  • Brevity is preferred
  • Multiple valid answer formats
  • No expected_output available
  • Creative/generative tasks

See Also: Answer Criteria

Answer Completeness checks coverage of expected output aspects. Answer Criteria checks coverage of custom acceptance criteria.

Use Completeness when you have a reference answer; use Criteria for custom requirements.


How It Works

The metric extracts key aspects from the expected output and checks if each is covered in the actual response.

Step-by-Step Process

flowchart TD
    subgraph INPUT["πŸ“₯ Inputs"]
        A[Query]
        B[AI Response]
        C[Expected Output]
    end

    subgraph EXTRACT["πŸ” Step 1: Aspect Extraction"]
        D[Extract Aspects from Expected]
        E["Key Aspects List"]
    end

    subgraph CHECK["βš–οΈ Step 2: Coverage Check"]
        F[Check Each Aspect in Response]
        G["Covered / Not Covered"]
    end

    subgraph SCORE["πŸ“Š Step 3: Scoring"]
        H["Count Covered Aspects"]
        I["Calculate Ratio"]
        J["Final Score"]
    end

    A & B & C --> D
    D --> E
    E --> F
    B --> F
    F --> G
    G --> H
    H --> I
    I --> J

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style EXTRACT stroke:#3b82f6,stroke-width:2px
    style CHECK stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style J fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

βœ… COVERED
1

Aspect from expected output is present in the response.

❌ NOT COVERED
0

Aspect from expected output is missing from the response.

Score Formula

score = covered_aspects / total_aspects

Configuration

Parameter Type Default Description
use_expected_output bool True Use expected_output for aspect extraction
mode EvaluationMode GRANULAR Evaluation detail level

Alternative Mode

When use_expected_output=False, the metric uses sub-question decomposition instead of aspect extraction.


Code Examples

from axion.metrics import AnswerCompleteness
from axion.dataset import DatasetItem

metric = AnswerCompleteness()

item = DatasetItem(
    query="What are the benefits of exercise?",
    actual_output="Exercise improves cardiovascular health and boosts mood.",
    expected_output="Exercise improves cardiovascular health, strengthens muscles, boosts mood, and helps with weight management.",
)

result = await metric.execute(item)
print(result.pretty())
# Score: 0.5 (2 of 4 aspects covered)
from axion.metrics import AnswerCompleteness
from axion.runners import MetricRunner

metric = AnswerCompleteness()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"Score: {item_result.score}")
    print(f"Covered: {item_result.signals.covered_aspects_count}/{item_result.signals.total_aspects_count}")

Metric Diagnostics

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was givenβ€”no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown
πŸ“Š AnswerCompletenessResult Structure
AnswerCompletenessResult(
{
    "score": 0.5,
    "covered_aspects_count": 2,
    "total_aspects_count": 4,
    "concept_coverage_score": 0.5,
    "aspect_breakdown": [
        {
            "aspect": "cardiovascular health improvement",
            "covered": true,
            "concepts_covered": ["cardiovascular health"],
            "reason": "Mentioned in response"
        },
        {
            "aspect": "muscle strengthening",
            "covered": false,
            "concepts_missing": ["muscles", "strength"],
            "reason": "Not mentioned in response"
        },
        {
            "aspect": "mood improvement",
            "covered": true,
            "concepts_covered": ["mood", "boosts"],
            "reason": "Mentioned in response"
        },
        {
            "aspect": "weight management",
            "covered": false,
            "concepts_missing": ["weight"],
            "reason": "Not mentioned in response"
        }
    ]
}
)

Signal Fields

Field Type Description
score float Overall completeness score
covered_aspects_count int Aspects found in response
total_aspects_count int Total aspects from expected output
aspect_breakdown List Per-aspect coverage details

Example Scenarios

βœ… Scenario 1: Complete Coverage (Score: 1.0)

All Aspects Covered

Expected Output:

"Python is a high-level programming language known for readability, extensive libraries, and cross-platform support."

AI Response:

"Python is a high-level language with clean, readable syntax. It has a vast ecosystem of libraries and runs on Windows, Mac, and Linux."

Analysis:

Aspect Covered
High-level language βœ…
Readability βœ…
Extensive libraries βœ…
Cross-platform βœ…

Final Score: 4 / 4 = 1.0

⚠️ Scenario 2: Partial Coverage (Score: 0.6)

Some Aspects Missing

Expected Output:

"Our product offers: free shipping, 30-day returns, 24/7 support, price matching, and warranty."

AI Response:

"We provide free shipping on all orders and a 30-day return policy. Our support team is available around the clock."

Analysis:

Aspect Covered
Free shipping βœ…
30-day returns βœ…
24/7 support βœ…
Price matching ❌
Warranty ❌

Final Score: 3 / 5 = 0.6

❌ Scenario 3: Poor Coverage (Score: 0.25)

Most Aspects Missing

Expected Output:

"The recipe requires flour, sugar, eggs, and butter. Preheat oven to 350Β°F. Mix ingredients, pour into pan, bake 25 minutes."

AI Response:

"You'll need flour and sugar."

Analysis:

Aspect Covered
Flour βœ…
Sugar βœ…
Eggs ❌
Butter ❌
Oven temperature ❌
Mixing instructions ❌
Baking time ❌

Final Score: 2 / 7 = 0.29


Why It Matters

πŸ“ Content Coverage

Ensures AI responses include all important information, not just some of it.

πŸŽ“ Educational Quality

Critical for tutoring systems where incomplete answers leave knowledge gaps.

πŸ“‹ Requirements Coverage

Verify that responses address all parts of complex queries.


Quick Reference

TL;DR

Answer Completeness = Does the response cover all aspects from the expected answer?

  • Use it when: You have reference answers and need comprehensive coverage
  • Score interpretation: Higher = more aspects from expected output covered
  • Key difference: Measures coverage, not accuracy