Sentence BLEU¶
Heuristic Single Turn Fast
At a Glance¶
Score Range
0.0 โโโโโโโโ 1.0N-gram precision score
Default Threshold
0.5Pass/fail cutoff
Required Inputs
actual_output expected_outputReference text required
What It Measures
Sentence BLEU (Bilingual Evaluation Understudy) computes the similarity between a candidate text and reference text using modified n-gram precision with a brevity penalty. Originally designed for machine translation, it's useful for any task where textual similarity to a reference matters.
| Score | Interpretation |
|---|---|
| 1.0 | Perfect n-gram match with reference |
| 0.7+ | High similarity, minor differences |
| 0.3-0.7 | Moderate similarity |
| < 0.3 | Low similarity to reference |
- Comparing text to reference translations
- Evaluating summarization quality
- Fast, deterministic evaluation needed
- N-gram overlap is meaningful
- Semantic similarity matters more than wording
- Multiple valid phrasings exist
- Evaluating creative/generative tasks
- Word order flexibility is expected
See Also: Levenshtein Ratio
Sentence BLEU measures n-gram precision (word sequences). Levenshtein Ratio measures character-level edit distance.
Use BLEU for word-level comparison; use Levenshtein for character-level.
How It Works
BLEU calculates n-gram precision with clipping and applies a brevity penalty.
Step-by-Step Process¶
flowchart TD
subgraph INPUT["๐ฅ Inputs"]
A[Candidate Text]
B[Reference Text]
end
subgraph NGRAM["๐ Step 1: N-gram Extraction"]
C[Extract 1-grams to n-grams]
D1["1-gram counts"]
D2["2-gram counts"]
D3["3-gram counts"]
DN["n-gram counts"]
end
subgraph PRECISION["โ๏ธ Step 2: Clipped Precision"]
E[Clip counts to reference max]
F["Calculate precision per n"]
end
subgraph SCORE["๐ Step 3: Final Score"]
G["Geometric mean of precisions"]
H["Apply brevity penalty"]
I["Final BLEU Score"]
end
A & B --> C
C --> D1 & D2 & D3 & DN
D1 & D2 & D3 & DN --> E
E --> F
F --> G
G --> H
H --> I
style INPUT stroke:#f59e0b,stroke-width:2px
style NGRAM stroke:#3b82f6,stroke-width:2px
style PRECISION stroke:#8b5cf6,stroke-width:2px
style SCORE stroke:#10b981,stroke-width:2px
style I fill:#f59e0b,stroke:#d97706,stroke-width:3px,color:#fff
Modified Precision:
Brevity Penalty (BP):
Final Score:
Prevents gaming by repeating words. Each n-gram counted at most as many times as it appears in reference.
Penalizes outputs shorter than reference. Prevents gaming by outputting only high-confidence words.
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
n_grams |
int |
4 |
Maximum n-gram length (e.g., 4 for BLEU-4) |
case_sensitive |
bool |
False |
Whether comparison is case-sensitive |
smoothing |
bool |
True |
Apply smoothing for sentence-level BLEU |
Smoothing
Sentence-level BLEU often has zero counts for higher n-grams. Smoothing (add-one) prevents the entire score from becoming zero.
Code Examples¶
from axion.metrics import SentenceBleu
from axion.dataset import DatasetItem
metric = SentenceBleu()
item = DatasetItem(
actual_output="The cat sat on the mat.",
expected_output="The cat is sitting on the mat.",
)
result = await metric.execute(item)
print(result.pretty())
# Score: ~0.6 (good n-gram overlap with minor differences)
Example Scenarios¶
โ Scenario 1: High BLEU Score (~0.9)
Near-Perfect Match
Reference:
"The quick brown fox jumps over the lazy dog."
Candidate:
"The quick brown fox jumped over the lazy dog."
Analysis:
- 1-grams: 8/9 match (jumped vs jumps)
- 2-grams: 7/8 match
- 3-grams: 6/7 match
- 4-grams: 5/6 match
- Brevity penalty: ~1.0 (same length)
Final Score: ~0.85
โ ๏ธ Scenario 2: Moderate BLEU Score (~0.5)
Partial Overlap
Reference:
"Machine learning models require large datasets for training."
Candidate:
"Deep learning needs big data to train properly."
Analysis:
- Same meaning, different words
- Few exact n-gram matches
- "learning" and "train" overlap
Final Score: ~0.3
Semantic similarity high, but n-gram overlap low.
โ Scenario 3: Low BLEU Score (~0.1)
Minimal Overlap
Reference:
"Paris is the capital of France."
Candidate:
"The Eiffel Tower is located in the French capital city."
Analysis:
- Related topic, completely different wording
- Almost no n-gram matches
Final Score: ~0.1
Semantically related but lexically different.
Why It Matters¶
No LLM calls needed. Instant, reproducible results ideal for CI/CD pipelines.
Widely used in NLP research for translation and summarization evaluation.
Captures phrase-level similarity, not just word overlap.
Quick Reference¶
TL;DR
Sentence BLEU = How much does the candidate text overlap with the reference at the n-gram level?
- Use it when: Fast, deterministic text similarity is needed
- Score interpretation: Higher = more n-gram overlap with reference
- Key config:
n_gramscontrols phrase length (default 4)
-
API Reference
-
Related Metrics
Levenshtein Ratio ยท Exact String Match ยท Contains Match