Answer Criteria¶

Evaluate responses against user-defined acceptance criteria
LLM-Powered Knowledge Single Turn Multi-Turn

At a Glance¶

🎯
Score Range
0.0 ──────── 1.0
Criteria coverage ratio

⚡
Default Threshold
0.5
Pass/fail cutoff

📋
Required Inputs
query actual_output
Optional: acceptance_criteria

What It Measures

Answer Criteria evaluates whether a response meets user-defined acceptance criteria. It decomposes criteria into aspects and concepts, then checks coverage. This is ideal for custom evaluation requirements that don't fit standard metrics.

Score	Interpretation
1.0	All criteria aspects fully covered
0.7+	Most criteria met, minor gaps
0.5	Half the criteria covered
< 0.5	Significant criteria not met

✅ Use When

Custom acceptance criteria exist
Domain-specific requirements
Multi-aspect evaluation needed
Testing conversational agents

❌ Don't Use When

Standard metrics suffice
No clear acceptance criteria
Purely factual evaluation
Simple pass/fail needed

Step-by-Step Process¶

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Query]
        B[AI Response]
        C[Acceptance Criteria]
    end

    subgraph DECOMPOSE["🔍 Step 1: Criteria Decomposition"]
        D[Extract Aspects]
        E["Aspects with Key Concepts"]
    end

    subgraph EVALUATE["⚖️ Step 2: Coverage Check"]
        F[Check Each Aspect]
        G["Covered / Missing Concepts"]
    end

    subgraph SCORE["📊 Step 3: Scoring"]
        H["Apply Scoring Strategy"]
        I["Final Score"]
    end

    A & B & C --> D
    D --> E
    E --> F
    B --> F
    F --> G
    G --> H
    H --> I

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style DECOMPOSE stroke:#3b82f6,stroke-width:2px
    style EVALUATE stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

Choose how to calculate the final score based on aspect and concept coverage.

📊 CONCEPT
Score = total_concepts_covered / total_concepts
Default. Granular concept-level coverage.

📋 ASPECT
Score = covered_aspects / total_aspects
Binary per-aspect (all-or-nothing).

⚖️ WEIGHTED
Score = 0.7 × concept_score + 0.3 × aspect_score
Blend of both approaches.

Configuration¶

Parameters Scoring Strategies

Parameter	Type	Default	Description
`criteria_key`	`str`	`'Complete'`	Key to look up criteria
`scoring_strategy`	`'concept'` \| `'aspect'` \| `'weighted'`	`'concept'`	How to calculate score
`check_for_contradictions`	`bool`	`False`	Check if response contradicts criteria
`weighted_concept_score_weight`	`float`	`0.7`	Weight for concept score in weighted strategy
`multi_turn_strategy`	`'last_turn'` \| `'all_turns'`	`'last_turn'`	How to evaluate conversations
`multi_turn_aggregation`	`'cumulative'` \| `'average'`	`'cumulative'`	How to aggregate multi-turn scores

from axion.metrics import AnswerCriteria

# Concept-level (default, most granular)
metric = AnswerCriteria(scoring_strategy='concept')

# Aspect-level (binary per aspect)
metric = AnswerCriteria(scoring_strategy='aspect')

# Weighted blend
metric = AnswerCriteria(
    scoring_strategy='weighted',
    weighted_concept_score_weight=0.7  # 70% concept, 30% aspect
)

Code Examples¶

Basic Usage Custom Criteria Multi-Turn

from axion.metrics import AnswerCriteria
from axion.dataset import DatasetItem

metric = AnswerCriteria()

item = DatasetItem(
    query="Explain how to make a good cup of coffee",
    actual_output="Use fresh beans, grind just before brewing, use water at 200°F, and brew for 4 minutes.",
    acceptance_criteria="Must mention: bean freshness, grind timing, water temperature, brew time",
)

result = await metric.execute(item)
print(result.pretty())

from axion.metrics import AnswerCriteria

# Strict aspect-level scoring
metric = AnswerCriteria(
    scoring_strategy='aspect',
    check_for_contradictions=True
)

item = DatasetItem(
    query="What's your return policy?",
    actual_output="You can return items within 30 days with receipt.",
    acceptance_criteria="""
    Must cover:
    1. Return window (30 days)
    2. Receipt requirement
    3. Condition of items
    4. Refund method
    """,
)

from axion.metrics import AnswerCriteria

metric = AnswerCriteria(
    multi_turn_strategy='all_turns',
    multi_turn_aggregation='cumulative'  # Criteria can be met across turns
)

Metric Diagnostics¶

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown

📊 AnswerCriteriaResult Structure

AnswerCriteriaResult(
{
    "scoring_strategy": "concept",
    "covered_aspects_count": 3,
    "total_aspects_count": 4,
    "total_concepts_covered": 5,
    "total_concepts": 7,
    "concept_coverage_score": 0.71,
    "aspect_breakdown": [
        {
            "aspect": "Bean freshness",
            "covered": true,
            "concepts_covered": ["fresh beans", "quality"],
            "concepts_missing": [],
            "reason": "Response mentions using fresh beans"
        },
        {
            "aspect": "Water temperature",
            "covered": true,
            "concepts_covered": ["200°F"],
            "concepts_missing": ["optimal range"],
            "reason": "Specific temperature provided"
        }
    ],
    "evaluated_turns_count": 1
}
)

Signal Fields¶

Field	Type	Description
`scoring_strategy`	`str`	Strategy used (concept/aspect/weighted)
`covered_aspects_count`	`int`	Aspects fully covered
`total_aspects_count`	`int`	Total aspects in criteria
`total_concepts_covered`	`int`	Concepts found in response
`total_concepts`	`int`	Total concepts across all aspects
`concept_coverage_score`	`float`	Concept-level coverage ratio
`aspect_breakdown`	`List`	Per-aspect coverage details

Example Scenarios¶

✅ Scenario 1: Full Coverage (Score: 1.0)

All Criteria Met

Criteria:

"Must mention: greeting, issue acknowledgment, solution, follow-up offer"

AI Response:

"Hello! I understand you're having trouble with your order. I've issued a full refund which will appear in 3-5 days. Is there anything else I can help with?"

Analysis:

Aspect	Covered	Concepts
Greeting	✅	"Hello"
Issue acknowledgment	✅	"trouble with your order"
Solution	✅	"full refund", "3-5 days"
Follow-up offer	✅	"anything else I can help"

Final Score: 4 / 4 = 1.0

⚠️ Scenario 2: Partial Coverage (Score: 0.5)

Some Criteria Missing

Criteria:

"Must include: product name, price, availability, shipping info"

AI Response:

"The Widget Pro costs $49.99 and is currently in stock."

Analysis:

Aspect	Covered	Concepts
Product name	✅	"Widget Pro"
Price	✅	"$49.99"
Availability	✅	"in stock"
Shipping info	❌	missing

Final Score (aspect): 3 / 4 = 0.75

No shipping information provided.

❌ Scenario 3: Poor Coverage (Score: 0.25)

Most Criteria Not Met

Criteria:

"Must cover: apology, explanation, compensation, prevention steps"

AI Response:

"We apologize for the inconvenience."

Analysis:

Aspect	Covered	Concepts
Apology	✅	"apologize"
Explanation	❌	missing
Compensation	❌	missing
Prevention steps	❌	missing

Final Score: 1 / 4 = 0.25

Why It Matters¶

🎯 Custom Requirements

Define exactly what a good response looks like for your specific use case.

📋 Policy Compliance

Ensure AI responses follow company guidelines, scripts, or regulatory requirements.

💬 Agent Quality

Evaluate customer service agents against expected response patterns.

Quick Reference¶

TL;DR

Answer Criteria = Does the response meet your custom acceptance criteria?

Use it when: You have specific requirements beyond standard metrics
Score interpretation: Higher = more criteria aspects covered
Key config: Choose scoring_strategy based on granularity needs

API Reference

axion.metrics.AnswerCriteria
Related Metrics

Answer Completeness · Answer Relevancy · Factual Accuracy