Tone & Style Consistency¶

Evaluate if responses match the expected tone, persona, and formatting style
LLM-Powered Knowledge Single Turn

At a Glance¶

🎯
Score Range
0.0 ──────── 1.0
Style alignment score

⚡
Default Threshold
0.8
Higher bar for consistency

📋
Required Inputs
actual_output expected_output
Optional: persona_description

What It Measures

Tone & Style Consistency evaluates whether a response matches the emotional tone and writing style of an expected answer. For customer service agents, "Voice" is as important as "Fact"—this metric ensures your AI maintains the right persona.

Score	Interpretation
1.0	Perfect match—exact emotion, enthusiasm, formatting
0.8	Minor drift—generally correct but slightly off
0.5	Significant mismatch—wrong tone or style
0.0	Complete failure—robotic, rude, or ignores persona

✅ Use When

Building customer service agents
Persona consistency matters
Brand voice guidelines exist
Comparing against reference responses

❌ Don't Use When

No expected_output available
Tone flexibility is acceptable
Only factual accuracy matters
Creative writing tasks

Step-by-Step Process¶

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Actual Output]
        B[Expected Output]
        C[Persona Description]
    end

    subgraph ANALYZE["🔍 Step 1: Style Analysis"]
        D[ToneJudge LLM]
        E["Tone & Style Comparison"]
    end

    subgraph EVALUATE["⚖️ Step 2: Dimension Scoring"]
        F[Evaluate Tone Match]
        G[Evaluate Style Match]
        H["Identify Differences"]
    end

    subgraph SCORE["📊 Step 3: Final Score"]
        I["Combine Dimensions"]
        J["Final Score"]
    end

    A & B & C --> D
    D --> E
    E --> F & G
    F & G --> H
    H --> I
    I --> J

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style ANALYZE stroke:#3b82f6,stroke-width:2px
    style EVALUATE stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style J fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

The metric evaluates responses on a detailed rubric with clear benchmarks.

✅ PERFECT MATCH

1.0

Exact emotion, enthusiasm level, and formatting style.

📊 MINOR DRIFT

0.8

Generally correct but slightly less enthusiastic or formal.

⚠️ SIGNIFICANT MISMATCH

0.5

Neutral when should be excited, or style completely different.

❌ COMPLETE FAILURE

0.0

Robotic, rude, or completely ignores persona.

Two Dimensions

Tone Match: Emotional alignment (enthusiasm, empathy, formality)
Style Match: Formatting, length, vocabulary, structure

Configuration¶

Parameters

Parameter	Type	Default	Description
`persona_description`	`str`	`None`	Optional persona to enforce (e.g., "Helpful, excited, professional")
`mode`	`EvaluationMode`	`GRANULAR`	Evaluation detail level

Persona Description

When provided, the persona description guides the judge on expected tone characteristics, making evaluation more precise for specific brand voices.

Code Examples¶

Basic Usage With Persona With Runner

from axion.metrics import ToneStyleConsistency
from axion.dataset import DatasetItem

metric = ToneStyleConsistency()

item = DatasetItem(
    actual_output="Your order has been shipped. It will arrive in 3-5 business days.",
    expected_output="Great news! 🎉 Your order is on its way! You can expect it within 3-5 business days. We're so excited for you!",
)

result = await metric.execute(item)
print(result.pretty())
# Score: 0.5 (tone mismatch - neutral vs enthusiastic)

from axion.metrics import ToneStyleConsistency

# Define expected persona
metric = ToneStyleConsistency()

item = DatasetItem(
    actual_output="I apologize for the inconvenience. Let me help resolve this.",
    expected_output="I'm truly sorry this happened. I completely understand your frustration, and I'm here to make things right!",
    persona_description="Empathetic, warm, solution-oriented customer service agent",
)

result = await metric.execute(item)

from axion.metrics import ToneStyleConsistency
from axion.runners import MetricRunner

metric = ToneStyleConsistency()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"Score: {item_result.score}")
    print(f"Tone Match: {item_result.signals.tone_match}")
    print(f"Style Match: {item_result.signals.style_match}")
    for diff in item_result.signals.differences:
        print(f"  - {diff.dimension}: {diff.description}")

Metric Diagnostics¶

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown

📊 ToneStyleResult Structure

ToneStyleResult(
{
    "final_score": 0.5,
    "tone_match": false,
    "style_match": true,
    "differences": [
        {
            "dimension": "Enthusiasm Level",
            "expected": "Excited, celebratory with emoji",
            "actual": "Neutral, matter-of-fact",
            "impact": "Major - missed opportunity to delight customer"
        },
        {
            "dimension": "Emotional Warmth",
            "expected": "Personal, caring language",
            "actual": "Formal, impersonal",
            "impact": "Moderate - feels robotic"
        }
    ]
}
)

Signal Fields¶

Field	Type	Description
`final_score`	`float`	Overall tone & style alignment score
`tone_match`	`bool`	Whether emotional tone matches expected
`style_match`	`bool`	Whether formatting/writing style matches
`differences`	`List`	Specific differences identified

Difference Fields¶

Field	Type	Description
`dimension`	`str`	Aspect that differs (e.g., "Enthusiasm Level")
`expected`	`str`	What was expected
`actual`	`str`	What was observed
`impact`	`str`	Severity of the mismatch

Example Scenarios¶

✅ Scenario 1: Perfect Match (Score: 1.0)

Tone & Style Aligned

Expected Output:

"Hi there! 👋 Thanks for reaching out! I'd be happy to help you with your question about returns. Our policy allows full refunds within 30 days!"

AI Response:

"Hello! 😊 Thanks so much for contacting us! I'm thrilled to assist with your returns question. You can get a full refund within 30 days—no problem at all!"

Analysis:

Dimension	Match
Enthusiasm	✅ Both excited and welcoming
Emoji usage	✅ Appropriate friendly emoji
Formality	✅ Casual, approachable
Helpfulness	✅ Eager to assist

Final Score: 1.0

⚠️ Scenario 2: Style Drift (Score: 0.5)

Tone Mismatch

Expected Output:

"Great news! 🎉 Your order is on its way! You can expect delivery within 3-5 business days. We're so excited for you!"

AI Response:

"Your order has been shipped. Estimated delivery: 3-5 business days."

Analysis:

Dimension	Match
Information	✅ Same facts conveyed
Enthusiasm	❌ Neutral vs celebratory
Emoji usage	❌ None vs appropriate celebration
Warmth	❌ Impersonal vs personal

Final Score: 0.5

Content is correct but voice is completely wrong.

❌ Scenario 3: Complete Mismatch (Score: 0.0)

Persona Ignored

Expected Output:

"I'm so sorry to hear about this issue! 😔 That's definitely not the experience we want for you. Let me personally look into this right away and make it right!"

AI Response:

"Your complaint has been logged. Reference number: #12345. Allow 5-7 business days for review."

Analysis:

Dimension	Match
Empathy	❌ None vs deeply apologetic
Tone	❌ Cold/bureaucratic vs warm
Personal touch	❌ Ticket number vs personal commitment
Resolution focus	❌ Process vs solution

Final Score: 0.0

Response is robotic when empathy was expected.

Why It Matters¶

🎭 Brand Voice

Ensures AI maintains your brand's personality across all interactions. Inconsistent tone damages brand perception.

💬 Customer Experience

Customers expect warmth and empathy, not robotic responses. Tone directly impacts satisfaction and loyalty.

🔄 Consistency

Maintain uniform voice across all AI-generated responses, regardless of the underlying model or prompt.

Quick Reference¶

TL;DR

Tone & Style Consistency = Does the AI response sound like it should?

Use it when: Brand voice and persona consistency matter
Score interpretation: Higher = better alignment with expected tone
Key difference: Measures how something is said, not what is said

API Reference

axion.metrics.ToneStyleConsistency
Related Metrics

Answer Completeness · Answer Relevancy · Answer Criteria