Skip to content

Sentence BLEU

Compute n-gram precision similarity between candidate and reference text
Heuristic Single Turn Fast

At a Glance

๐ŸŽฏ
Score Range
0.0 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ 1.0
N-gram precision score
โšก
Default Threshold
0.5
Pass/fail cutoff
๐Ÿ“‹
Required Inputs
actual_output expected_output
Reference text required

What It Measures

Sentence BLEU (Bilingual Evaluation Understudy) computes the similarity between a candidate text and reference text using modified n-gram precision with a brevity penalty. Originally designed for machine translation, it's useful for any task where textual similarity to a reference matters.

Score Interpretation
1.0 Perfect n-gram match with reference
0.7+ High similarity, minor differences
0.3-0.7 Moderate similarity
< 0.3 Low similarity to reference
โœ… Use When
  • Comparing text to reference translations
  • Evaluating summarization quality
  • Fast, deterministic evaluation needed
  • N-gram overlap is meaningful
โŒ Don't Use When
  • Semantic similarity matters more than wording
  • Multiple valid phrasings exist
  • Evaluating creative/generative tasks
  • Word order flexibility is expected

See Also: Levenshtein Ratio

Sentence BLEU measures n-gram precision (word sequences). Levenshtein Ratio measures character-level edit distance.

Use BLEU for word-level comparison; use Levenshtein for character-level.


How It Works

BLEU calculates n-gram precision with clipping and applies a brevity penalty.

Step-by-Step Process

flowchart TD
    subgraph INPUT["๐Ÿ“ฅ Inputs"]
        A[Candidate Text]
        B[Reference Text]
    end

    subgraph NGRAM["๐Ÿ” Step 1: N-gram Extraction"]
        C[Extract 1-grams to n-grams]
        D1["1-gram counts"]
        D2["2-gram counts"]
        D3["3-gram counts"]
        DN["n-gram counts"]
    end

    subgraph PRECISION["โš–๏ธ Step 2: Clipped Precision"]
        E[Clip counts to reference max]
        F["Calculate precision per n"]
    end

    subgraph SCORE["๐Ÿ“Š Step 3: Final Score"]
        G["Geometric mean of precisions"]
        H["Apply brevity penalty"]
        I["Final BLEU Score"]
    end

    A & B --> C
    C --> D1 & D2 & D3 & DN
    D1 & D2 & D3 & DN --> E
    E --> F
    F --> G
    G --> H
    H --> I

    style INPUT stroke:#f59e0b,stroke-width:2px
    style NGRAM stroke:#3b82f6,stroke-width:2px
    style PRECISION stroke:#8b5cf6,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style I fill:#f59e0b,stroke:#d97706,stroke-width:3px,color:#fff

Modified Precision:

p_n = ฮฃ min(count(ngram), max_ref_count(ngram)) / ฮฃ count(ngram)

Brevity Penalty (BP):

BP = 1                    if c > r
BP = exp(1 - r/c)         if c โ‰ค r

where c = candidate length, r = reference length

Final Score:

BLEU = BP ร— exp(ฮฃ w_n ร— log(p_n))

where w_n = 1/N (uniform weights)

๐Ÿ“ Clipping
Prevents gaming by repeating words. Each n-gram counted at most as many times as it appears in reference.

๐Ÿ“ Brevity Penalty
Penalizes outputs shorter than reference. Prevents gaming by outputting only high-confidence words.


Configuration

Parameter Type Default Description
n_grams int 4 Maximum n-gram length (e.g., 4 for BLEU-4)
case_sensitive bool False Whether comparison is case-sensitive
smoothing bool True Apply smoothing for sentence-level BLEU

Smoothing

Sentence-level BLEU often has zero counts for higher n-grams. Smoothing (add-one) prevents the entire score from becoming zero.


Code Examples

from axion.metrics import SentenceBleu
from axion.dataset import DatasetItem

metric = SentenceBleu()

item = DatasetItem(
    actual_output="The cat sat on the mat.",
    expected_output="The cat is sitting on the mat.",
)

result = await metric.execute(item)
print(result.pretty())
# Score: ~0.6 (good n-gram overlap with minor differences)
from axion.metrics import SentenceBleu

# BLEU-2 for shorter sequences
metric = SentenceBleu(n_grams=2)

# Case-sensitive comparison
metric = SentenceBleu(case_sensitive=True)

# Without smoothing (corpus-level style)
metric = SentenceBleu(smoothing=False)
from axion.metrics import SentenceBleu
from axion.runners import MetricRunner

metric = SentenceBleu(n_grams=4)
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"BLEU-4: {item_result.score:.3f}")

Example Scenarios

โœ… Scenario 1: High BLEU Score (~0.9)

Near-Perfect Match

Reference:

"The quick brown fox jumps over the lazy dog."

Candidate:

"The quick brown fox jumped over the lazy dog."

Analysis:

  • 1-grams: 8/9 match (jumped vs jumps)
  • 2-grams: 7/8 match
  • 3-grams: 6/7 match
  • 4-grams: 5/6 match
  • Brevity penalty: ~1.0 (same length)

Final Score: ~0.85

โš ๏ธ Scenario 2: Moderate BLEU Score (~0.5)

Partial Overlap

Reference:

"Machine learning models require large datasets for training."

Candidate:

"Deep learning needs big data to train properly."

Analysis:

  • Same meaning, different words
  • Few exact n-gram matches
  • "learning" and "train" overlap

Final Score: ~0.3

Semantic similarity high, but n-gram overlap low.

โŒ Scenario 3: Low BLEU Score (~0.1)

Minimal Overlap

Reference:

"Paris is the capital of France."

Candidate:

"The Eiffel Tower is located in the French capital city."

Analysis:

  • Related topic, completely different wording
  • Almost no n-gram matches

Final Score: ~0.1

Semantically related but lexically different.


Why It Matters

โšก Fast & Deterministic

No LLM calls needed. Instant, reproducible results ideal for CI/CD pipelines.

๐Ÿ“Š Industry Standard

Widely used in NLP research for translation and summarization evaluation.

๐Ÿ”ข N-gram Precision

Captures phrase-level similarity, not just word overlap.


Quick Reference

TL;DR

Sentence BLEU = How much does the candidate text overlap with the reference at the n-gram level?

  • Use it when: Fast, deterministic text similarity is needed
  • Score interpretation: Higher = more n-gram overlap with reference
  • Key config: n_grams controls phrase length (default 4)