Answer Completeness¶
LLM-Powered Knowledge Single Turn
At a Glance¶
Score Range
0.0 ββββββββ 1.0Coverage ratio
Default Threshold
0.5Pass/fail cutoff
Required Inputs
query actual_output expected_outputReference answer required
What It Measures
Answer Completeness evaluates whether the response covers all the key aspects from the expected output. It answers: "Did the AI mention everything important from the reference answer?"
| Score | Interpretation |
|---|---|
| 1.0 | All aspects from expected output covered |
| 0.7+ | Most aspects covered, minor omissions |
| 0.5 | Half the expected content covered |
| < 0.5 | Significant content missing |
- You have reference answers
- Completeness matters more than brevity
- Testing comprehensive responses
- Evaluating educational content
- Brevity is preferred
- Multiple valid answer formats
- No expected_output available
- Creative/generative tasks
See Also: Answer Criteria
Answer Completeness checks coverage of expected output aspects. Answer Criteria checks coverage of custom acceptance criteria.
Use Completeness when you have a reference answer; use Criteria for custom requirements.
How It Works
The metric extracts key aspects from the expected output and checks if each is covered in the actual response.
Step-by-Step Process¶
flowchart TD
subgraph INPUT["π₯ Inputs"]
A[Query]
B[AI Response]
C[Expected Output]
end
subgraph EXTRACT["π Step 1: Aspect Extraction"]
D[Extract Aspects from Expected]
E["Key Aspects List"]
end
subgraph CHECK["βοΈ Step 2: Coverage Check"]
F[Check Each Aspect in Response]
G["Covered / Not Covered"]
end
subgraph SCORE["π Step 3: Scoring"]
H["Count Covered Aspects"]
I["Calculate Ratio"]
J["Final Score"]
end
A & B & C --> D
D --> E
E --> F
B --> F
F --> G
G --> H
H --> I
I --> J
style INPUT stroke:#1E3A5F,stroke-width:2px
style EXTRACT stroke:#3b82f6,stroke-width:2px
style CHECK stroke:#f59e0b,stroke-width:2px
style SCORE stroke:#10b981,stroke-width:2px
style J fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
use_expected_output |
bool |
True |
Use expected_output for aspect extraction |
mode |
EvaluationMode |
GRANULAR |
Evaluation detail level |
Alternative Mode
When use_expected_output=False, the metric uses sub-question decomposition instead of aspect extraction.
Code Examples¶
from axion.metrics import AnswerCompleteness
from axion.dataset import DatasetItem
metric = AnswerCompleteness()
item = DatasetItem(
query="What are the benefits of exercise?",
actual_output="Exercise improves cardiovascular health and boosts mood.",
expected_output="Exercise improves cardiovascular health, strengthens muscles, boosts mood, and helps with weight management.",
)
result = await metric.execute(item)
print(result.pretty())
# Score: 0.5 (2 of 4 aspects covered)
from axion.metrics import AnswerCompleteness
from axion.runners import MetricRunner
metric = AnswerCompleteness()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)
for item_result in results:
print(f"Score: {item_result.score}")
print(f"Covered: {item_result.signals.covered_aspects_count}/{item_result.signals.total_aspects_count}")
Metric Diagnostics¶
Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was givenβno black boxes.
result = await metric.execute(item)
print(result.pretty()) # Human-readable summary
result.signals # Full diagnostic breakdown
π AnswerCompletenessResult Structure
AnswerCompletenessResult(
{
"score": 0.5,
"covered_aspects_count": 2,
"total_aspects_count": 4,
"concept_coverage_score": 0.5,
"aspect_breakdown": [
{
"aspect": "cardiovascular health improvement",
"covered": true,
"concepts_covered": ["cardiovascular health"],
"reason": "Mentioned in response"
},
{
"aspect": "muscle strengthening",
"covered": false,
"concepts_missing": ["muscles", "strength"],
"reason": "Not mentioned in response"
},
{
"aspect": "mood improvement",
"covered": true,
"concepts_covered": ["mood", "boosts"],
"reason": "Mentioned in response"
},
{
"aspect": "weight management",
"covered": false,
"concepts_missing": ["weight"],
"reason": "Not mentioned in response"
}
]
}
)
Signal Fields¶
| Field | Type | Description |
|---|---|---|
score |
float |
Overall completeness score |
covered_aspects_count |
int |
Aspects found in response |
total_aspects_count |
int |
Total aspects from expected output |
aspect_breakdown |
List |
Per-aspect coverage details |
Example Scenarios¶
β Scenario 1: Complete Coverage (Score: 1.0)
All Aspects Covered
Expected Output:
"Python is a high-level programming language known for readability, extensive libraries, and cross-platform support."
AI Response:
"Python is a high-level language with clean, readable syntax. It has a vast ecosystem of libraries and runs on Windows, Mac, and Linux."
Analysis:
| Aspect | Covered |
|---|---|
| High-level language | β |
| Readability | β |
| Extensive libraries | β |
| Cross-platform | β |
Final Score: 4 / 4 = 1.0
β οΈ Scenario 2: Partial Coverage (Score: 0.6)
Some Aspects Missing
Expected Output:
"Our product offers: free shipping, 30-day returns, 24/7 support, price matching, and warranty."
AI Response:
"We provide free shipping on all orders and a 30-day return policy. Our support team is available around the clock."
Analysis:
| Aspect | Covered |
|---|---|
| Free shipping | β |
| 30-day returns | β |
| 24/7 support | β |
| Price matching | β |
| Warranty | β |
Final Score: 3 / 5 = 0.6
β Scenario 3: Poor Coverage (Score: 0.25)
Most Aspects Missing
Expected Output:
"The recipe requires flour, sugar, eggs, and butter. Preheat oven to 350Β°F. Mix ingredients, pour into pan, bake 25 minutes."
AI Response:
"You'll need flour and sugar."
Analysis:
| Aspect | Covered |
|---|---|
| Flour | β |
| Sugar | β |
| Eggs | β |
| Butter | β |
| Oven temperature | β |
| Mixing instructions | β |
| Baking time | β |
Final Score: 2 / 7 = 0.29
Why It Matters¶
Ensures AI responses include all important information, not just some of it.
Critical for tutoring systems where incomplete answers leave knowledge gaps.
Verify that responses address all parts of complex queries.
Quick Reference¶
TL;DR
Answer Completeness = Does the response cover all aspects from the expected answer?
- Use it when: You have reference answers and need comprehensive coverage
- Score interpretation: Higher = more aspects from expected output covered
- Key difference: Measures coverage, not accuracy
-
API Reference
-
Related Metrics
Answer Criteria Β· Factual Accuracy Β· Answer Relevancy