Sentence BLEU¶

Compute n-gram precision similarity between candidate and reference text
Heuristic Single Turn Fast

At a Glance¶

🎯
Score Range
0.0 ──────── 1.0
N-gram precision score

⚡
Default Threshold
0.5
Pass/fail cutoff

📋
Required Inputs
actual_output expected_output
Reference text required

What It Measures

Sentence BLEU (Bilingual Evaluation Understudy) computes the similarity between a candidate text and reference text using modified n-gram precision with a brevity penalty. Originally designed for machine translation, it's useful for any task where textual similarity to a reference matters.

Score	Interpretation
1.0	Perfect n-gram match with reference
0.7+	High similarity, minor differences
0.3-0.7	Moderate similarity
< 0.3	Low similarity to reference

✅ Use When

Comparing text to reference translations
Evaluating summarization quality
Fast, deterministic evaluation needed
N-gram overlap is meaningful

❌ Don't Use When

Semantic similarity matters more than wording
Multiple valid phrasings exist
Evaluating creative/generative tasks
Word order flexibility is expected

Step-by-Step Process¶

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Candidate Text]
        B[Reference Text]
    end

    subgraph NGRAM["🔍 Step 1: N-gram Extraction"]
        C[Extract 1-grams to n-grams]
        D1["1-gram counts"]
        D2["2-gram counts"]
        D3["3-gram counts"]
        DN["n-gram counts"]
    end

    subgraph PRECISION["⚖️ Step 2: Clipped Precision"]
        E[Clip counts to reference max]
        F["Calculate precision per n"]
    end

    subgraph SCORE["📊 Step 3: Final Score"]
        G["Geometric mean of precisions"]
        H["Apply brevity penalty"]
        I["Final BLEU Score"]
    end

    A & B --> C
    C --> D1 & D2 & D3 & DN
    D1 & D2 & D3 & DN --> E
    E --> F
    F --> G
    G --> H
    H --> I

    style INPUT stroke:#f59e0b,stroke-width:2px
    style NGRAM stroke:#3b82f6,stroke-width:2px
    style PRECISION stroke:#8b5cf6,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style I fill:#f59e0b,stroke:#d97706,stroke-width:3px,color:#fff

Modified Precision:

p_n = Σ min(count(ngram), max_ref_count(ngram)) / Σ count(ngram)

Brevity Penalty (BP):

BP = 1                    if c > r
BP = exp(1 - r/c)         if c ≤ r

where c = candidate length, r = reference length

Final Score:

BLEU = BP × exp(Σ w_n × log(p_n))

where w_n = 1/N (uniform weights)

📝 Clipping
Prevents gaming by repeating words. Each n-gram counted at most as many times as it appears in reference.

📏 Brevity Penalty
Penalizes outputs shorter than reference. Prevents gaming by outputting only high-confidence words.

Configuration¶

Parameters

Parameter	Type	Default	Description
`n_grams`	`int`	`4`	Maximum n-gram length (e.g., 4 for BLEU-4)
`case_sensitive`	`bool`	`False`	Whether comparison is case-sensitive
`smoothing`	`bool`	`True`	Apply smoothing for sentence-level BLEU

Smoothing

Sentence-level BLEU often has zero counts for higher n-grams. Smoothing (add-one) prevents the entire score from becoming zero.

Code Examples¶

Basic Usage Custom N-grams With Runner

from axion.metrics import SentenceBleu
from axion.dataset import DatasetItem

metric = SentenceBleu()

item = DatasetItem(
    actual_output="The cat sat on the mat.",
    expected_output="The cat is sitting on the mat.",
)

result = await metric.execute(item)
print(result.pretty())
# Score: ~0.6 (good n-gram overlap with minor differences)

from axion.metrics import SentenceBleu

# BLEU-2 for shorter sequences
metric = SentenceBleu(n_grams=2)

# Case-sensitive comparison
metric = SentenceBleu(case_sensitive=True)

# Without smoothing (corpus-level style)
metric = SentenceBleu(smoothing=False)

from axion.metrics import SentenceBleu
from axion.runners import MetricRunner

metric = SentenceBleu(n_grams=4)
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"BLEU-4: {item_result.score:.3f}")

Example Scenarios¶

✅ Scenario 1: High BLEU Score (~0.9)

Near-Perfect Match

Reference:

"The quick brown fox jumps over the lazy dog."

Candidate:

"The quick brown fox jumped over the lazy dog."

Analysis:

1-grams: 8/9 match (jumped vs jumps)
2-grams: 7/8 match
3-grams: 6/7 match
4-grams: 5/6 match
Brevity penalty: ~1.0 (same length)

Final Score: ~0.85

⚠️ Scenario 2: Moderate BLEU Score (~0.5)

Partial Overlap

Reference:

"Machine learning models require large datasets for training."

Candidate:

"Deep learning needs big data to train properly."

Analysis:

Same meaning, different words
Few exact n-gram matches
"learning" and "train" overlap

Final Score: ~0.3

Semantic similarity high, but n-gram overlap low.

❌ Scenario 3: Low BLEU Score (~0.1)

Minimal Overlap

Reference:

"Paris is the capital of France."

Candidate:

"The Eiffel Tower is located in the French capital city."

Analysis:

Related topic, completely different wording
Almost no n-gram matches

Final Score: ~0.1

Semantically related but lexically different.

Why It Matters¶

⚡ Fast & Deterministic

No LLM calls needed. Instant, reproducible results ideal for CI/CD pipelines.

📊 Industry Standard

Widely used in NLP research for translation and summarization evaluation.

🔢 N-gram Precision

Captures phrase-level similarity, not just word overlap.

Quick Reference¶

TL;DR

Sentence BLEU = How much does the candidate text overlap with the reference at the n-gram level?

Use it when: Fast, deterministic text similarity is needed
Score interpretation: Higher = more n-gram overlap with reference
Key config: n_grams controls phrase length (default 4)

API Reference

axion.metrics.SentenceBleu
Related Metrics

Levenshtein Ratio · Exact String Match · Contains Match