Answer Criteria¶
LLM-Powered Knowledge Single Turn Multi-Turn
At a Glance¶
Score Range
0.0 ββββββββ 1.0Criteria coverage ratio
Default Threshold
0.5Pass/fail cutoff
Required Inputs
query actual_outputOptional:
acceptance_criteria
What It Measures
Answer Criteria evaluates whether a response meets user-defined acceptance criteria. It decomposes criteria into aspects and concepts, then checks coverage. This is ideal for custom evaluation requirements that don't fit standard metrics.
| Score | Interpretation |
|---|---|
| 1.0 | All criteria aspects fully covered |
| 0.7+ | Most criteria met, minor gaps |
| 0.5 | Half the criteria covered |
| < 0.5 | Significant criteria not met |
- Custom acceptance criteria exist
- Domain-specific requirements
- Multi-aspect evaluation needed
- Testing conversational agents
- Standard metrics suffice
- No clear acceptance criteria
- Purely factual evaluation
- Simple pass/fail needed
See Also: Answer Completeness
Answer Criteria evaluates against custom acceptance criteria. Answer Completeness evaluates against expected output aspects.
Use Criteria for custom requirements; use Completeness when you have a reference answer.
How It Works
The metric decomposes acceptance criteria into aspects, identifies key concepts per aspect, then checks if the response covers them.
Step-by-Step Process¶
flowchart TD
subgraph INPUT["π₯ Inputs"]
A[Query]
B[AI Response]
C[Acceptance Criteria]
end
subgraph DECOMPOSE["π Step 1: Criteria Decomposition"]
D[Extract Aspects]
E["Aspects with Key Concepts"]
end
subgraph EVALUATE["βοΈ Step 2: Coverage Check"]
F[Check Each Aspect]
G["Covered / Missing Concepts"]
end
subgraph SCORE["π Step 3: Scoring"]
H["Apply Scoring Strategy"]
I["Final Score"]
end
A & B & C --> D
D --> E
E --> F
B --> F
F --> G
G --> H
H --> I
style INPUT stroke:#1E3A5F,stroke-width:2px
style DECOMPOSE stroke:#3b82f6,stroke-width:2px
style EVALUATE stroke:#f59e0b,stroke-width:2px
style SCORE stroke:#10b981,stroke-width:2px
style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff
Choose how to calculate the final score based on aspect and concept coverage.
Score = total_concepts_covered / total_concepts
Default. Granular concept-level coverage.
Score = covered_aspects / total_aspects
Binary per-aspect (all-or-nothing).
Score = 0.7 Γ concept_score + 0.3 Γ aspect_score
Blend of both approaches.
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
criteria_key |
str |
'Complete' |
Key to look up criteria |
scoring_strategy |
'concept' | 'aspect' | 'weighted' |
'concept' |
How to calculate score |
check_for_contradictions |
bool |
False |
Check if response contradicts criteria |
weighted_concept_score_weight |
float |
0.7 |
Weight for concept score in weighted strategy |
multi_turn_strategy |
'last_turn' | 'all_turns' |
'last_turn' |
How to evaluate conversations |
multi_turn_aggregation |
'cumulative' | 'average' |
'cumulative' |
How to aggregate multi-turn scores |
from axion.metrics import AnswerCriteria
# Concept-level (default, most granular)
metric = AnswerCriteria(scoring_strategy='concept')
# Aspect-level (binary per aspect)
metric = AnswerCriteria(scoring_strategy='aspect')
# Weighted blend
metric = AnswerCriteria(
scoring_strategy='weighted',
weighted_concept_score_weight=0.7 # 70% concept, 30% aspect
)
Code Examples¶
from axion.metrics import AnswerCriteria
from axion.dataset import DatasetItem
metric = AnswerCriteria()
item = DatasetItem(
query="Explain how to make a good cup of coffee",
actual_output="Use fresh beans, grind just before brewing, use water at 200Β°F, and brew for 4 minutes.",
acceptance_criteria="Must mention: bean freshness, grind timing, water temperature, brew time",
)
result = await metric.execute(item)
print(result.pretty())
from axion.metrics import AnswerCriteria
# Strict aspect-level scoring
metric = AnswerCriteria(
scoring_strategy='aspect',
check_for_contradictions=True
)
item = DatasetItem(
query="What's your return policy?",
actual_output="You can return items within 30 days with receipt.",
acceptance_criteria="""
Must cover:
1. Return window (30 days)
2. Receipt requirement
3. Condition of items
4. Refund method
""",
)
Metric Diagnostics¶
Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was givenβno black boxes.
result = await metric.execute(item)
print(result.pretty()) # Human-readable summary
result.signals # Full diagnostic breakdown
π AnswerCriteriaResult Structure
AnswerCriteriaResult(
{
"scoring_strategy": "concept",
"covered_aspects_count": 3,
"total_aspects_count": 4,
"total_concepts_covered": 5,
"total_concepts": 7,
"concept_coverage_score": 0.71,
"aspect_breakdown": [
{
"aspect": "Bean freshness",
"covered": true,
"concepts_covered": ["fresh beans", "quality"],
"concepts_missing": [],
"reason": "Response mentions using fresh beans"
},
{
"aspect": "Water temperature",
"covered": true,
"concepts_covered": ["200Β°F"],
"concepts_missing": ["optimal range"],
"reason": "Specific temperature provided"
}
],
"evaluated_turns_count": 1
}
)
Signal Fields¶
| Field | Type | Description |
|---|---|---|
scoring_strategy |
str |
Strategy used (concept/aspect/weighted) |
covered_aspects_count |
int |
Aspects fully covered |
total_aspects_count |
int |
Total aspects in criteria |
total_concepts_covered |
int |
Concepts found in response |
total_concepts |
int |
Total concepts across all aspects |
concept_coverage_score |
float |
Concept-level coverage ratio |
aspect_breakdown |
List |
Per-aspect coverage details |
Example Scenarios¶
β Scenario 1: Full Coverage (Score: 1.0)
All Criteria Met
Criteria:
"Must mention: greeting, issue acknowledgment, solution, follow-up offer"
AI Response:
"Hello! I understand you're having trouble with your order. I've issued a full refund which will appear in 3-5 days. Is there anything else I can help with?"
Analysis:
| Aspect | Covered | Concepts |
|---|---|---|
| Greeting | β | "Hello" |
| Issue acknowledgment | β | "trouble with your order" |
| Solution | β | "full refund", "3-5 days" |
| Follow-up offer | β | "anything else I can help" |
Final Score: 4 / 4 = 1.0
β οΈ Scenario 2: Partial Coverage (Score: 0.5)
Some Criteria Missing
Criteria:
"Must include: product name, price, availability, shipping info"
AI Response:
"The Widget Pro costs $49.99 and is currently in stock."
Analysis:
| Aspect | Covered | Concepts |
|---|---|---|
| Product name | β | "Widget Pro" |
| Price | β | "$49.99" |
| Availability | β | "in stock" |
| Shipping info | β | missing |
Final Score (aspect): 3 / 4 = 0.75
No shipping information provided.
β Scenario 3: Poor Coverage (Score: 0.25)
Most Criteria Not Met
Criteria:
"Must cover: apology, explanation, compensation, prevention steps"
AI Response:
"We apologize for the inconvenience."
Analysis:
| Aspect | Covered | Concepts |
|---|---|---|
| Apology | β | "apologize" |
| Explanation | β | missing |
| Compensation | β | missing |
| Prevention steps | β | missing |
Final Score: 1 / 4 = 0.25
Why It Matters¶
Define exactly what a good response looks like for your specific use case.
Ensure AI responses follow company guidelines, scripts, or regulatory requirements.
Evaluate customer service agents against expected response patterns.
Quick Reference¶
TL;DR
Answer Criteria = Does the response meet your custom acceptance criteria?
- Use it when: You have specific requirements beyond standard metrics
- Score interpretation: Higher = more criteria aspects covered
- Key config: Choose
scoring_strategybased on granularity needs
-
API Reference
-
Related Metrics
Answer Completeness Β· Answer Relevancy Β· Factual Accuracy