Heuristic Metrics¶
13 Metrics No LLM Required
Heuristic metrics are rule-based evaluation metrics that don't require LLM calls. They're ideal for production environments where speed, cost efficiency, and deterministic results are critical. These metrics use pattern matching, statistical analysis, and algorithmic comparisons.
String Matching Metrics¶
Compare actual outputs against expected outputs using various matching strategies.
Safety & Compliance Metrics¶
Evaluate outputs for privacy, citations, and policy compliance.
Performance Metrics¶
Monitor execution time and operational performance.
Output Constraints¶
Enforce length and format requirements on outputs.
Retrieval Metrics (IR)¶
Standard information retrieval metrics for evaluating search and ranking quality.
Quick Reference¶
| Metric | Score Range | Threshold | Key Question |
|---|---|---|---|
| Exact String Match | 0.0 or 1.0 | 0.5 | Are strings identical? |
| Contains Match | 0.0 or 1.0 | 0.5 | Is expected text in output? |
| Levenshtein Ratio | 0.0 – 1.0 | 0.2 | How similar are the strings? |
| Sentence BLEU | 0.0 – 1.0 | 0.5 | How much n-gram overlap? |
| PII Leakage (Heuristic) | 0.0 – 1.0 | 0.8 | Is output privacy-safe? (1.0 = safe) |
| Citation Presence | 0.0 or 1.0 | 0.5 | Are citations included? |
| Latency | 0.0 – ∞ | 5.0s | How fast was the response? |
| Length Constraint | 0.0 or 1.0 | 1.0 | Within char/sentence limits? |
| Hit Rate @ K | 0.0 or 1.0 | - | Any relevant in top K? |
| MRR | 0.0 – 1.0 | - | How early is first relevant? |
| NDCG @ K | 0.0 – 1.0 | - | Is ranking optimal? |
| Precision @ K | 0.0 – 1.0 | - | Are results mostly relevant? |
| Recall @ K | 0.0 – 1.0 | - | Did we find all relevant? |
Usage Example¶
from axion.metrics import (
ExactStringMatch,
LevenshteinRatio,
LengthConstraint,
PIILeakageHeuristic,
HitRateAtK,
)
from axion.runners import MetricRunner
from axion.dataset import Dataset
# Initialize metrics
metrics = [
ExactStringMatch(),
LevenshteinRatio(case_sensitive=False),
LengthConstraint(max_chars=1000, sentence_range=(1, 5)),
PIILeakageHeuristic(confidence_threshold=0.7),
HitRateAtK(k=10),
]
# Run evaluation
runner = MetricRunner(metrics=metrics)
results = await runner.run(dataset)
# Analyze results
for item in results:
print(f"Exact Match: {item.scores.get('exact_string_match', 'N/A')}")
print(f"Similarity: {item.scores.get('levenshtein_ratio', 'N/A'):.2f}")
print(f"Privacy Safe: {item.scores.get('pii_leakage_heuristic', 'N/A'):.2f}")
Choosing the Right Metrics¶
Evaluation Strategy
For Exact Outputs (Code, JSON, IDs):
- Use Exact String Match for strict equality
- Add Contains Match for partial verification
For Natural Language:
- Use Levenshtein Ratio for typo/variation tolerance
- Use Sentence BLEU for paraphrase comparison
For Privacy & Compliance:
- Use PII Leakage (Heuristic) for fast screening
- Add Citation Presence for source attribution
For Output Length/Format:
- Use Length Constraint for character and sentence limits
For Search/Retrieval:
- Use Hit Rate for quick sanity checks
- Use NDCG for comprehensive ranking evaluation
- Use Precision/Recall for classic IR metrics
Why Heuristic Metrics?¶
No LLM calls needed—microsecond latency.
No API costs or token usage.
Same input always produces same output.
Evaluate millions of items without limits.