Tone & Style Consistency¶
LLM-Powered Knowledge Single Turn
At a Glance¶
Score Range
0.0 ──────── 1.0Style alignment score
Default Threshold
0.8Higher bar for consistency
Required Inputs
actual_output expected_outputOptional:
persona_description
What It Measures
Tone & Style Consistency evaluates whether a response matches the emotional tone and writing style of an expected answer. For customer service agents, "Voice" is as important as "Fact"—this metric ensures your AI maintains the right persona.
| Score | Interpretation |
|---|---|
| 1.0 | Perfect match—exact emotion, enthusiasm, formatting |
| 0.8 | Minor drift—generally correct but slightly off |
| 0.5 | Significant mismatch—wrong tone or style |
| 0.0 | Complete failure—robotic, rude, or ignores persona |
- Building customer service agents
- Persona consistency matters
- Brand voice guidelines exist
- Comparing against reference responses
- No expected_output available
- Tone flexibility is acceptable
- Only factual accuracy matters
- Creative writing tasks
See Also: Answer Completeness
Tone & Style Consistency evaluates how something is said (voice, formatting). Answer Completeness evaluates what is said (content coverage).
Use both together for comprehensive response quality evaluation.
How It Works
The metric uses an LLM-based judge to evaluate both emotional tone and writing style.
Step-by-Step Process¶
flowchart TD
subgraph INPUT["📥 Inputs"]
A[Actual Output]
B[Expected Output]
C[Persona Description]
end
subgraph ANALYZE["🔍 Step 1: Style Analysis"]
D[ToneJudge LLM]
E["Tone & Style Comparison"]
end
subgraph EVALUATE["⚖️ Step 2: Dimension Scoring"]
F[Evaluate Tone Match]
G[Evaluate Style Match]
H["Identify Differences"]
end
subgraph SCORE["📊 Step 3: Final Score"]
I["Combine Dimensions"]
J["Final Score"]
end
A & B & C --> D
D --> E
E --> F & G
F & G --> H
H --> I
I --> J
style INPUT stroke:#1E3A5F,stroke-width:2px
style ANALYZE stroke:#3b82f6,stroke-width:2px
style EVALUATE stroke:#f59e0b,stroke-width:2px
style SCORE stroke:#10b981,stroke-width:2px
style J fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff
The metric evaluates responses on a detailed rubric with clear benchmarks.
Exact emotion, enthusiasm level, and formatting style.
Generally correct but slightly less enthusiastic or formal.
Neutral when should be excited, or style completely different.
Robotic, rude, or completely ignores persona.
Two Dimensions
- Tone Match: Emotional alignment (enthusiasm, empathy, formality)
- Style Match: Formatting, length, vocabulary, structure
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
persona_description |
str |
None |
Optional persona to enforce (e.g., "Helpful, excited, professional") |
mode |
EvaluationMode |
GRANULAR |
Evaluation detail level |
Persona Description
When provided, the persona description guides the judge on expected tone characteristics, making evaluation more precise for specific brand voices.
Code Examples¶
from axion.metrics import ToneStyleConsistency
from axion.dataset import DatasetItem
metric = ToneStyleConsistency()
item = DatasetItem(
actual_output="Your order has been shipped. It will arrive in 3-5 business days.",
expected_output="Great news! 🎉 Your order is on its way! You can expect it within 3-5 business days. We're so excited for you!",
)
result = await metric.execute(item)
print(result.pretty())
# Score: 0.5 (tone mismatch - neutral vs enthusiastic)
from axion.metrics import ToneStyleConsistency
# Define expected persona
metric = ToneStyleConsistency()
item = DatasetItem(
actual_output="I apologize for the inconvenience. Let me help resolve this.",
expected_output="I'm truly sorry this happened. I completely understand your frustration, and I'm here to make things right!",
persona_description="Empathetic, warm, solution-oriented customer service agent",
)
result = await metric.execute(item)
from axion.metrics import ToneStyleConsistency
from axion.runners import MetricRunner
metric = ToneStyleConsistency()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)
for item_result in results:
print(f"Score: {item_result.score}")
print(f"Tone Match: {item_result.signals.tone_match}")
print(f"Style Match: {item_result.signals.style_match}")
for diff in item_result.signals.differences:
print(f" - {diff.dimension}: {diff.description}")
Metric Diagnostics¶
Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.
result = await metric.execute(item)
print(result.pretty()) # Human-readable summary
result.signals # Full diagnostic breakdown
📊 ToneStyleResult Structure
ToneStyleResult(
{
"final_score": 0.5,
"tone_match": false,
"style_match": true,
"differences": [
{
"dimension": "Enthusiasm Level",
"expected": "Excited, celebratory with emoji",
"actual": "Neutral, matter-of-fact",
"impact": "Major - missed opportunity to delight customer"
},
{
"dimension": "Emotional Warmth",
"expected": "Personal, caring language",
"actual": "Formal, impersonal",
"impact": "Moderate - feels robotic"
}
]
}
)
Signal Fields¶
| Field | Type | Description |
|---|---|---|
final_score |
float |
Overall tone & style alignment score |
tone_match |
bool |
Whether emotional tone matches expected |
style_match |
bool |
Whether formatting/writing style matches |
differences |
List |
Specific differences identified |
Difference Fields¶
| Field | Type | Description |
|---|---|---|
dimension |
str |
Aspect that differs (e.g., "Enthusiasm Level") |
expected |
str |
What was expected |
actual |
str |
What was observed |
impact |
str |
Severity of the mismatch |
Example Scenarios¶
✅ Scenario 1: Perfect Match (Score: 1.0)
Tone & Style Aligned
Expected Output:
"Hi there! 👋 Thanks for reaching out! I'd be happy to help you with your question about returns. Our policy allows full refunds within 30 days!"
AI Response:
"Hello! 😊 Thanks so much for contacting us! I'm thrilled to assist with your returns question. You can get a full refund within 30 days—no problem at all!"
Analysis:
| Dimension | Match |
|---|---|
| Enthusiasm | ✅ Both excited and welcoming |
| Emoji usage | ✅ Appropriate friendly emoji |
| Formality | ✅ Casual, approachable |
| Helpfulness | ✅ Eager to assist |
Final Score: 1.0
⚠️ Scenario 2: Style Drift (Score: 0.5)
Tone Mismatch
Expected Output:
"Great news! 🎉 Your order is on its way! You can expect delivery within 3-5 business days. We're so excited for you!"
AI Response:
"Your order has been shipped. Estimated delivery: 3-5 business days."
Analysis:
| Dimension | Match |
|---|---|
| Information | ✅ Same facts conveyed |
| Enthusiasm | ❌ Neutral vs celebratory |
| Emoji usage | ❌ None vs appropriate celebration |
| Warmth | ❌ Impersonal vs personal |
Final Score: 0.5
Content is correct but voice is completely wrong.
❌ Scenario 3: Complete Mismatch (Score: 0.0)
Persona Ignored
Expected Output:
"I'm so sorry to hear about this issue! 😔 That's definitely not the experience we want for you. Let me personally look into this right away and make it right!"
AI Response:
"Your complaint has been logged. Reference number: #12345. Allow 5-7 business days for review."
Analysis:
| Dimension | Match |
|---|---|
| Empathy | ❌ None vs deeply apologetic |
| Tone | ❌ Cold/bureaucratic vs warm |
| Personal touch | ❌ Ticket number vs personal commitment |
| Resolution focus | ❌ Process vs solution |
Final Score: 0.0
Response is robotic when empathy was expected.
Why It Matters¶
Ensures AI maintains your brand's personality across all interactions. Inconsistent tone damages brand perception.
Customers expect warmth and empathy, not robotic responses. Tone directly impacts satisfaction and loyalty.
Maintain uniform voice across all AI-generated responses, regardless of the underlying model or prompt.
Quick Reference¶
TL;DR
Tone & Style Consistency = Does the AI response sound like it should?
- Use it when: Brand voice and persona consistency matter
- Score interpretation: Higher = better alignment with expected tone
- Key difference: Measures how something is said, not what is said
-
API Reference
-
Related Metrics
Answer Completeness · Answer Relevancy · Answer Criteria