Skip to content

Tone & Style Consistency

Evaluate if responses match the expected tone, persona, and formatting style
LLM-Powered Knowledge Single Turn

At a Glance

🎯
Score Range
0.0 ──────── 1.0
Style alignment score

Default Threshold
0.8
Higher bar for consistency
📋
Required Inputs
actual_output expected_output
Optional: persona_description

What It Measures

Tone & Style Consistency evaluates whether a response matches the emotional tone and writing style of an expected answer. For customer service agents, "Voice" is as important as "Fact"—this metric ensures your AI maintains the right persona.

Score Interpretation
1.0 Perfect match—exact emotion, enthusiasm, formatting
0.8 Minor drift—generally correct but slightly off
0.5 Significant mismatch—wrong tone or style
0.0 Complete failure—robotic, rude, or ignores persona
✅ Use When
  • Building customer service agents
  • Persona consistency matters
  • Brand voice guidelines exist
  • Comparing against reference responses
❌ Don't Use When
  • No expected_output available
  • Tone flexibility is acceptable
  • Only factual accuracy matters
  • Creative writing tasks

See Also: Answer Completeness

Tone & Style Consistency evaluates how something is said (voice, formatting). Answer Completeness evaluates what is said (content coverage).

Use both together for comprehensive response quality evaluation.


How It Works

The metric uses an LLM-based judge to evaluate both emotional tone and writing style.

Step-by-Step Process

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Actual Output]
        B[Expected Output]
        C[Persona Description]
    end

    subgraph ANALYZE["🔍 Step 1: Style Analysis"]
        D[ToneJudge LLM]
        E["Tone & Style Comparison"]
    end

    subgraph EVALUATE["⚖️ Step 2: Dimension Scoring"]
        F[Evaluate Tone Match]
        G[Evaluate Style Match]
        H["Identify Differences"]
    end

    subgraph SCORE["📊 Step 3: Final Score"]
        I["Combine Dimensions"]
        J["Final Score"]
    end

    A & B & C --> D
    D --> E
    E --> F & G
    F & G --> H
    H --> I
    I --> J

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style ANALYZE stroke:#3b82f6,stroke-width:2px
    style EVALUATE stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style J fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

The metric evaluates responses on a detailed rubric with clear benchmarks.

✅ PERFECT MATCH
1.0

Exact emotion, enthusiasm level, and formatting style.

📊 MINOR DRIFT
0.8

Generally correct but slightly less enthusiastic or formal.

⚠️ SIGNIFICANT MISMATCH
0.5

Neutral when should be excited, or style completely different.

❌ COMPLETE FAILURE
0.0

Robotic, rude, or completely ignores persona.

Two Dimensions

  • Tone Match: Emotional alignment (enthusiasm, empathy, formality)
  • Style Match: Formatting, length, vocabulary, structure

Configuration

Parameter Type Default Description
persona_description str None Optional persona to enforce (e.g., "Helpful, excited, professional")
mode EvaluationMode GRANULAR Evaluation detail level

Persona Description

When provided, the persona description guides the judge on expected tone characteristics, making evaluation more precise for specific brand voices.


Code Examples

from axion.metrics import ToneStyleConsistency
from axion.dataset import DatasetItem

metric = ToneStyleConsistency()

item = DatasetItem(
    actual_output="Your order has been shipped. It will arrive in 3-5 business days.",
    expected_output="Great news! 🎉 Your order is on its way! You can expect it within 3-5 business days. We're so excited for you!",
)

result = await metric.execute(item)
print(result.pretty())
# Score: 0.5 (tone mismatch - neutral vs enthusiastic)
from axion.metrics import ToneStyleConsistency

# Define expected persona
metric = ToneStyleConsistency()

item = DatasetItem(
    actual_output="I apologize for the inconvenience. Let me help resolve this.",
    expected_output="I'm truly sorry this happened. I completely understand your frustration, and I'm here to make things right!",
    persona_description="Empathetic, warm, solution-oriented customer service agent",
)

result = await metric.execute(item)
from axion.metrics import ToneStyleConsistency
from axion.runners import MetricRunner

metric = ToneStyleConsistency()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"Score: {item_result.score}")
    print(f"Tone Match: {item_result.signals.tone_match}")
    print(f"Style Match: {item_result.signals.style_match}")
    for diff in item_result.signals.differences:
        print(f"  - {diff.dimension}: {diff.description}")

Metric Diagnostics

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown
📊 ToneStyleResult Structure
ToneStyleResult(
{
    "final_score": 0.5,
    "tone_match": false,
    "style_match": true,
    "differences": [
        {
            "dimension": "Enthusiasm Level",
            "expected": "Excited, celebratory with emoji",
            "actual": "Neutral, matter-of-fact",
            "impact": "Major - missed opportunity to delight customer"
        },
        {
            "dimension": "Emotional Warmth",
            "expected": "Personal, caring language",
            "actual": "Formal, impersonal",
            "impact": "Moderate - feels robotic"
        }
    ]
}
)

Signal Fields

Field Type Description
final_score float Overall tone & style alignment score
tone_match bool Whether emotional tone matches expected
style_match bool Whether formatting/writing style matches
differences List Specific differences identified

Difference Fields

Field Type Description
dimension str Aspect that differs (e.g., "Enthusiasm Level")
expected str What was expected
actual str What was observed
impact str Severity of the mismatch

Example Scenarios

✅ Scenario 1: Perfect Match (Score: 1.0)

Tone & Style Aligned

Expected Output:

"Hi there! 👋 Thanks for reaching out! I'd be happy to help you with your question about returns. Our policy allows full refunds within 30 days!"

AI Response:

"Hello! 😊 Thanks so much for contacting us! I'm thrilled to assist with your returns question. You can get a full refund within 30 days—no problem at all!"

Analysis:

Dimension Match
Enthusiasm ✅ Both excited and welcoming
Emoji usage ✅ Appropriate friendly emoji
Formality ✅ Casual, approachable
Helpfulness ✅ Eager to assist

Final Score: 1.0

⚠️ Scenario 2: Style Drift (Score: 0.5)

Tone Mismatch

Expected Output:

"Great news! 🎉 Your order is on its way! You can expect delivery within 3-5 business days. We're so excited for you!"

AI Response:

"Your order has been shipped. Estimated delivery: 3-5 business days."

Analysis:

Dimension Match
Information ✅ Same facts conveyed
Enthusiasm ❌ Neutral vs celebratory
Emoji usage ❌ None vs appropriate celebration
Warmth ❌ Impersonal vs personal

Final Score: 0.5

Content is correct but voice is completely wrong.

❌ Scenario 3: Complete Mismatch (Score: 0.0)

Persona Ignored

Expected Output:

"I'm so sorry to hear about this issue! 😔 That's definitely not the experience we want for you. Let me personally look into this right away and make it right!"

AI Response:

"Your complaint has been logged. Reference number: #12345. Allow 5-7 business days for review."

Analysis:

Dimension Match
Empathy ❌ None vs deeply apologetic
Tone ❌ Cold/bureaucratic vs warm
Personal touch ❌ Ticket number vs personal commitment
Resolution focus ❌ Process vs solution

Final Score: 0.0

Response is robotic when empathy was expected.


Why It Matters

🎭 Brand Voice

Ensures AI maintains your brand's personality across all interactions. Inconsistent tone damages brand perception.

💬 Customer Experience

Customers expect warmth and empathy, not robotic responses. Tone directly impacts satisfaction and loyalty.

🔄 Consistency

Maintain uniform voice across all AI-generated responses, regardless of the underlying model or prompt.


Quick Reference

TL;DR

Tone & Style Consistency = Does the AI response sound like it should?

  • Use it when: Brand voice and persona consistency matter
  • Score interpretation: Higher = better alignment with expected tone
  • Key difference: Measures how something is said, not what is said