Skip to content

Heuristic Metrics¶

Fast, deterministic evaluation metrics using rule-based and statistical methods
13 Metrics No LLM Required

Heuristic metrics are rule-based evaluation metrics that don't require LLM calls. They're ideal for production environments where speed, cost efficiency, and deterministic results are critical. These metrics use pattern matching, statistical analysis, and algorithmic comparisons.

String Matching Metrics¶

Compare actual outputs against expected outputs using various matching strategies.

Exact String Match

Check for identical strings

actual_output expected_output

Contains Match

Check if output contains expected text

actual_output expected_output

Levenshtein Ratio

Character-level string similarity

actual_output expected_output

Sentence BLEU

N-gram precision similarity

actual_output expected_output

Safety & Compliance Metrics¶

Evaluate outputs for privacy, citations, and policy compliance.

PII Leakage (Heuristic)

Detect PII using regex patterns

query actual_output

Citation Presence

Verify responses include citations

actual_output

Performance Metrics¶

Monitor execution time and operational performance.

Latency

Measure and evaluate execution time

latency

Output Constraints¶

Enforce length and format requirements on outputs.

Length Constraint

Verify character and sentence limits

actual_output

Retrieval Metrics (IR)¶

Standard information retrieval metrics for evaluating search and ranking quality.

Hit Rate @ K

Any relevant result in top K?

actual_ranking expected_reference

MRR

Rank of first relevant result

actual_ranking expected_reference

NDCG @ K

Graded relevance with discounting

actual_ranking expected_reference

Precision @ K

Fraction of top K that's relevant

actual_ranking expected_reference

Recall @ K

Coverage of relevant documents

actual_ranking expected_reference

Quick Reference¶

Metric	Score Range	Threshold	Key Question
Exact String Match	0.0 or 1.0	0.5	Are strings identical?
Contains Match	0.0 or 1.0	0.5	Is expected text in output?
Levenshtein Ratio	0.0 – 1.0	0.2	How similar are the strings?
Sentence BLEU	0.0 – 1.0	0.5	How much n-gram overlap?
PII Leakage (Heuristic)	0.0 – 1.0	0.8	Is output privacy-safe? (1.0 = safe)
Citation Presence	0.0 or 1.0	0.5	Are citations included?
Latency	0.0 – ∞	5.0s	How fast was the response?
Length Constraint	0.0 or 1.0	1.0	Within char/sentence limits?
Hit Rate @ K	0.0 or 1.0	-	Any relevant in top K?
MRR	0.0 – 1.0	-	How early is first relevant?
NDCG @ K	0.0 – 1.0	-	Is ranking optimal?
Precision @ K	0.0 – 1.0	-	Are results mostly relevant?
Recall @ K	0.0 – 1.0	-	Did we find all relevant?

Usage Example¶

from axion.metrics import (
    ExactStringMatch,
    LevenshteinRatio,
    LengthConstraint,
    PIILeakageHeuristic,
    HitRateAtK,
)
from axion.runners import MetricRunner
from axion.dataset import Dataset

# Initialize metrics
metrics = [
    ExactStringMatch(),
    LevenshteinRatio(case_sensitive=False),
    LengthConstraint(max_chars=1000, sentence_range=(1, 5)),
    PIILeakageHeuristic(confidence_threshold=0.7),
    HitRateAtK(k=10),
]

# Run evaluation
runner = MetricRunner(metrics=metrics)
results = await runner.run(dataset)

# Analyze results
for item in results:
    print(f"Exact Match: {item.scores.get('exact_string_match', 'N/A')}")
    print(f"Similarity: {item.scores.get('levenshtein_ratio', 'N/A'):.2f}")
    print(f"Privacy Safe: {item.scores.get('pii_leakage_heuristic', 'N/A'):.2f}")

Choosing the Right Metrics¶

Evaluation Strategy

For Exact Outputs (Code, JSON, IDs):

Use Exact String Match for strict equality
Add Contains Match for partial verification

For Natural Language:

Use Levenshtein Ratio for typo/variation tolerance
Use Sentence BLEU for paraphrase comparison

For Privacy & Compliance:

Use PII Leakage (Heuristic) for fast screening
Add Citation Presence for source attribution

For Output Length/Format:

Use Length Constraint for character and sentence limits

For Search/Retrieval:

Use Hit Rate for quick sanity checks
Use NDCG for comprehensive ranking evaluation
Use Precision/Recall for classic IR metrics

Why Heuristic Metrics?¶

⚡ Instant Results

No LLM calls needed—microsecond latency.

💰 Zero Cost

No API costs or token usage.

🔄 Deterministic

Same input always produces same output.

📈 Scalable

Evaluate millions of items without limits.