Skip to content

Heuristic Metrics

Fast, deterministic evaluation metrics using rule-based and statistical methods
13 Metrics No LLM Required

Heuristic metrics are rule-based evaluation metrics that don't require LLM calls. They're ideal for production environments where speed, cost efficiency, and deterministic results are critical. These metrics use pattern matching, statistical analysis, and algorithmic comparisons.


String Matching Metrics

Compare actual outputs against expected outputs using various matching strategies.

Exact String Match

Check for identical strings

actual_output expected_output
Contains Match

Check if output contains expected text

actual_output expected_output
Levenshtein Ratio

Character-level string similarity

actual_output expected_output
Sentence BLEU

N-gram precision similarity

actual_output expected_output

Safety & Compliance Metrics

Evaluate outputs for privacy, citations, and policy compliance.

PII Leakage (Heuristic)

Detect PII using regex patterns

query actual_output
Citation Presence

Verify responses include citations

actual_output

Performance Metrics

Monitor execution time and operational performance.

Latency

Measure and evaluate execution time

latency

Output Constraints

Enforce length and format requirements on outputs.

Length Constraint

Verify character and sentence limits

actual_output

Retrieval Metrics (IR)

Standard information retrieval metrics for evaluating search and ranking quality.

Hit Rate @ K

Any relevant result in top K?

actual_ranking expected_reference
MRR

Rank of first relevant result

actual_ranking expected_reference
NDCG @ K

Graded relevance with discounting

actual_ranking expected_reference
Precision @ K

Fraction of top K that's relevant

actual_ranking expected_reference
Recall @ K

Coverage of relevant documents

actual_ranking expected_reference

Quick Reference

Metric Score Range Threshold Key Question
Exact String Match 0.0 or 1.0 0.5 Are strings identical?
Contains Match 0.0 or 1.0 0.5 Is expected text in output?
Levenshtein Ratio 0.0 – 1.0 0.2 How similar are the strings?
Sentence BLEU 0.0 – 1.0 0.5 How much n-gram overlap?
PII Leakage (Heuristic) 0.0 – 1.0 0.8 Is output privacy-safe? (1.0 = safe)
Citation Presence 0.0 or 1.0 0.5 Are citations included?
Latency 0.0 – ∞ 5.0s How fast was the response?
Length Constraint 0.0 or 1.0 1.0 Within char/sentence limits?
Hit Rate @ K 0.0 or 1.0 - Any relevant in top K?
MRR 0.0 – 1.0 - How early is first relevant?
NDCG @ K 0.0 – 1.0 - Is ranking optimal?
Precision @ K 0.0 – 1.0 - Are results mostly relevant?
Recall @ K 0.0 – 1.0 - Did we find all relevant?

Usage Example

from axion.metrics import (
    ExactStringMatch,
    LevenshteinRatio,
    LengthConstraint,
    PIILeakageHeuristic,
    HitRateAtK,
)
from axion.runners import MetricRunner
from axion.dataset import Dataset

# Initialize metrics
metrics = [
    ExactStringMatch(),
    LevenshteinRatio(case_sensitive=False),
    LengthConstraint(max_chars=1000, sentence_range=(1, 5)),
    PIILeakageHeuristic(confidence_threshold=0.7),
    HitRateAtK(k=10),
]

# Run evaluation
runner = MetricRunner(metrics=metrics)
results = await runner.run(dataset)

# Analyze results
for item in results:
    print(f"Exact Match: {item.scores.get('exact_string_match', 'N/A')}")
    print(f"Similarity: {item.scores.get('levenshtein_ratio', 'N/A'):.2f}")
    print(f"Privacy Safe: {item.scores.get('pii_leakage_heuristic', 'N/A'):.2f}")

Choosing the Right Metrics

Evaluation Strategy

For Exact Outputs (Code, JSON, IDs):

  • Use Exact String Match for strict equality
  • Add Contains Match for partial verification

For Natural Language:

  • Use Levenshtein Ratio for typo/variation tolerance
  • Use Sentence BLEU for paraphrase comparison

For Privacy & Compliance:

  • Use PII Leakage (Heuristic) for fast screening
  • Add Citation Presence for source attribution

For Output Length/Format:

  • Use Length Constraint for character and sentence limits

For Search/Retrieval:

  • Use Hit Rate for quick sanity checks
  • Use NDCG for comprehensive ranking evaluation
  • Use Precision/Recall for classic IR metrics

Why Heuristic Metrics?

Instant Results

No LLM calls needed—microsecond latency.

💰 Zero Cost

No API costs or token usage.

🔄 Deterministic

Same input always produces same output.

📈 Scalable

Evaluate millions of items without limits.