Skip to content

Answer Criteria

Evaluate responses against user-defined acceptance criteria
LLM-Powered Knowledge Single Turn Multi-Turn

At a Glance

🎯
Score Range
0.0 ──────── 1.0
Criteria coverage ratio
⚑
Default Threshold
0.5
Pass/fail cutoff
πŸ“‹
Required Inputs
query actual_output
Optional: acceptance_criteria

What It Measures

Answer Criteria evaluates whether a response meets user-defined acceptance criteria. It decomposes criteria into aspects and concepts, then checks coverage. This is ideal for custom evaluation requirements that don't fit standard metrics.

Score Interpretation
1.0 All criteria aspects fully covered
0.7+ Most criteria met, minor gaps
0.5 Half the criteria covered
< 0.5 Significant criteria not met
βœ… Use When
  • Custom acceptance criteria exist
  • Domain-specific requirements
  • Multi-aspect evaluation needed
  • Testing conversational agents
❌ Don't Use When
  • Standard metrics suffice
  • No clear acceptance criteria
  • Purely factual evaluation
  • Simple pass/fail needed

See Also: Answer Completeness

Answer Criteria evaluates against custom acceptance criteria. Answer Completeness evaluates against expected output aspects.

Use Criteria for custom requirements; use Completeness when you have a reference answer.


How It Works

The metric decomposes acceptance criteria into aspects, identifies key concepts per aspect, then checks if the response covers them.

Step-by-Step Process

flowchart TD
    subgraph INPUT["πŸ“₯ Inputs"]
        A[Query]
        B[AI Response]
        C[Acceptance Criteria]
    end

    subgraph DECOMPOSE["πŸ” Step 1: Criteria Decomposition"]
        D[Extract Aspects]
        E["Aspects with Key Concepts"]
    end

    subgraph EVALUATE["βš–οΈ Step 2: Coverage Check"]
        F[Check Each Aspect]
        G["Covered / Missing Concepts"]
    end

    subgraph SCORE["πŸ“Š Step 3: Scoring"]
        H["Apply Scoring Strategy"]
        I["Final Score"]
    end

    A & B & C --> D
    D --> E
    E --> F
    B --> F
    F --> G
    G --> H
    H --> I

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style DECOMPOSE stroke:#3b82f6,stroke-width:2px
    style EVALUATE stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style I fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

Choose how to calculate the final score based on aspect and concept coverage.

πŸ“Š CONCEPT
Score = total_concepts_covered / total_concepts
Default. Granular concept-level coverage.

πŸ“‹ ASPECT
Score = covered_aspects / total_aspects
Binary per-aspect (all-or-nothing).

βš–οΈ WEIGHTED
Score = 0.7 Γ— concept_score + 0.3 Γ— aspect_score
Blend of both approaches.


Configuration

Parameter Type Default Description
criteria_key str 'Complete' Key to look up criteria
scoring_strategy 'concept' | 'aspect' | 'weighted' 'concept' How to calculate score
check_for_contradictions bool False Check if response contradicts criteria
weighted_concept_score_weight float 0.7 Weight for concept score in weighted strategy
multi_turn_strategy 'last_turn' | 'all_turns' 'last_turn' How to evaluate conversations
multi_turn_aggregation 'cumulative' | 'average' 'cumulative' How to aggregate multi-turn scores
from axion.metrics import AnswerCriteria

# Concept-level (default, most granular)
metric = AnswerCriteria(scoring_strategy='concept')

# Aspect-level (binary per aspect)
metric = AnswerCriteria(scoring_strategy='aspect')

# Weighted blend
metric = AnswerCriteria(
    scoring_strategy='weighted',
    weighted_concept_score_weight=0.7  # 70% concept, 30% aspect
)

Code Examples

from axion.metrics import AnswerCriteria
from axion.dataset import DatasetItem

metric = AnswerCriteria()

item = DatasetItem(
    query="Explain how to make a good cup of coffee",
    actual_output="Use fresh beans, grind just before brewing, use water at 200Β°F, and brew for 4 minutes.",
    acceptance_criteria="Must mention: bean freshness, grind timing, water temperature, brew time",
)

result = await metric.execute(item)
print(result.pretty())
from axion.metrics import AnswerCriteria

# Strict aspect-level scoring
metric = AnswerCriteria(
    scoring_strategy='aspect',
    check_for_contradictions=True
)

item = DatasetItem(
    query="What's your return policy?",
    actual_output="You can return items within 30 days with receipt.",
    acceptance_criteria="""
    Must cover:
    1. Return window (30 days)
    2. Receipt requirement
    3. Condition of items
    4. Refund method
    """,
)
from axion.metrics import AnswerCriteria

metric = AnswerCriteria(
    multi_turn_strategy='all_turns',
    multi_turn_aggregation='cumulative'  # Criteria can be met across turns
)

Metric Diagnostics

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was givenβ€”no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown
πŸ“Š AnswerCriteriaResult Structure
AnswerCriteriaResult(
{
    "scoring_strategy": "concept",
    "covered_aspects_count": 3,
    "total_aspects_count": 4,
    "total_concepts_covered": 5,
    "total_concepts": 7,
    "concept_coverage_score": 0.71,
    "aspect_breakdown": [
        {
            "aspect": "Bean freshness",
            "covered": true,
            "concepts_covered": ["fresh beans", "quality"],
            "concepts_missing": [],
            "reason": "Response mentions using fresh beans"
        },
        {
            "aspect": "Water temperature",
            "covered": true,
            "concepts_covered": ["200Β°F"],
            "concepts_missing": ["optimal range"],
            "reason": "Specific temperature provided"
        }
    ],
    "evaluated_turns_count": 1
}
)

Signal Fields

Field Type Description
scoring_strategy str Strategy used (concept/aspect/weighted)
covered_aspects_count int Aspects fully covered
total_aspects_count int Total aspects in criteria
total_concepts_covered int Concepts found in response
total_concepts int Total concepts across all aspects
concept_coverage_score float Concept-level coverage ratio
aspect_breakdown List Per-aspect coverage details

Example Scenarios

βœ… Scenario 1: Full Coverage (Score: 1.0)

All Criteria Met

Criteria:

"Must mention: greeting, issue acknowledgment, solution, follow-up offer"

AI Response:

"Hello! I understand you're having trouble with your order. I've issued a full refund which will appear in 3-5 days. Is there anything else I can help with?"

Analysis:

Aspect Covered Concepts
Greeting βœ… "Hello"
Issue acknowledgment βœ… "trouble with your order"
Solution βœ… "full refund", "3-5 days"
Follow-up offer βœ… "anything else I can help"

Final Score: 4 / 4 = 1.0

⚠️ Scenario 2: Partial Coverage (Score: 0.5)

Some Criteria Missing

Criteria:

"Must include: product name, price, availability, shipping info"

AI Response:

"The Widget Pro costs $49.99 and is currently in stock."

Analysis:

Aspect Covered Concepts
Product name βœ… "Widget Pro"
Price βœ… "$49.99"
Availability βœ… "in stock"
Shipping info ❌ missing

Final Score (aspect): 3 / 4 = 0.75

No shipping information provided.

❌ Scenario 3: Poor Coverage (Score: 0.25)

Most Criteria Not Met

Criteria:

"Must cover: apology, explanation, compensation, prevention steps"

AI Response:

"We apologize for the inconvenience."

Analysis:

Aspect Covered Concepts
Apology βœ… "apologize"
Explanation ❌ missing
Compensation ❌ missing
Prevention steps ❌ missing

Final Score: 1 / 4 = 0.25


Why It Matters

🎯 Custom Requirements

Define exactly what a good response looks like for your specific use case.

πŸ“‹ Policy Compliance

Ensure AI responses follow company guidelines, scripts, or regulatory requirements.

πŸ’¬ Agent Quality

Evaluate customer service agents against expected response patterns.


Quick Reference

TL;DR

Answer Criteria = Does the response meet your custom acceptance criteria?

  • Use it when: You have specific requirements beyond standard metrics
  • Score interpretation: Higher = more criteria aspects covered
  • Key config: Choose scoring_strategy based on granularity needs