Retrieval Metrics¶
Evaluate information retrieval quality with standard IR metrics
Heuristic Retrieval Multi-K
Heuristic Retrieval Multi-K
Overview¶
Axion provides a comprehensive suite of Information Retrieval (IR) metrics for evaluating search and retrieval systems. These metrics compare retrieved document rankings against ground truth relevance judgments.
| Metric | What It Measures | Use Case |
|---|---|---|
| Hit Rate @ K | Any relevant result in top K? | Quick relevance check |
| MRR | Rank of first relevant result | First-result quality |
| NDCG @ K | Graded relevance with position discount | Ranking quality |
| Precision @ K | Fraction of top K that's relevant | Result purity |
| Recall @ K | Fraction of relevant docs in top K | Coverage |
Multi-K Support
All @K metrics support evaluating at multiple K values simultaneously (e.g., k=[5, 10, 20]). This allows comparing retrieval quality at different cutoffs in a single evaluation pass.
Required Inputs¶
All retrieval metrics require the same input structure:
| Field | Type | Description |
|---|---|---|
actual_ranking |
List[Dict] |
Retrieved documents in order, each with id key |
expected_reference |
List[Dict] |
Ground truth with id and optional relevance score |
from axion.dataset import DatasetItem
item = DatasetItem(
actual_ranking=[
{"id": "doc1"}, # Position 1
{"id": "doc2"}, # Position 2
{"id": "doc3"}, # Position 3
],
expected_reference=[
{"id": "doc1", "relevance": 1.0}, # Relevant
{"id": "doc5", "relevance": 1.0}, # Relevant but not retrieved
],
)
Hit Rate @ K¶
Binary check: Was ANY relevant document retrieved in the top K?
At a Glance¶
π―
Score Range
Binary pass/fail
Score Range
0.0 or 1.0Binary pass/fail
π
Default K
Top results to check
Default K
10Top results to check
How It Works¶
flowchart LR
A[Retrieved Docs] --> B{Any relevant<br>in top K?}
B -->|Yes| C["Score: 1.0"]
B -->|No| D["Score: 0.0"]
style C fill:#10b981,stroke:#059669,color:#fff
style D fill:#ef4444,stroke:#dc2626,color:#fff
Usage¶
from axion.metrics import HitRateAtK
# Single K
metric = HitRateAtK(k=10)
# Multiple K values
metric = HitRateAtK(k=[5, 10, 20], main_k=10)
result = await metric.execute(item)
print(result.score) # 1.0 if hit, 0.0 if miss
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
k |
int \| List[int] |
10 |
Cutoff(s) for evaluation |
main_k |
int |
max(k) |
K value used for primary score |
Mean Reciprocal Rank (MRR)¶
How early does the first relevant result appear?
At a Glance¶
π―
Score Range
1/rank of first relevant
Score Range
0.0 ββββ 1.01/rank of first relevant
π
K-Independent
Evaluates full ranking
K-Independent
Evaluates full ranking
How It Works¶
MRR = 1 / rank_of_first_relevant_document
Examples:
- First relevant at position 1 β MRR = 1.0
- First relevant at position 2 β MRR = 0.5
- First relevant at position 4 β MRR = 0.25
- No relevant found β MRR = 0.0
Usage¶
from axion.metrics import MeanReciprocalRank
metric = MeanReciprocalRank()
result = await metric.execute(item)
print(result.score) # 1/rank or 0.0
print(result.signals.rank_of_first_relevant) # e.g., 3
NDCG @ K¶
Normalized Discounted Cumulative Gainβhandles graded relevance with position discounting.
At a Glance¶
π―
Score Range
Normalized ranking quality
Score Range
0.0 ββββ 1.0Normalized ranking quality
π
Default K
Top results to evaluate
Default K
10Top results to evaluate
How It Works¶
flowchart TD
subgraph DCG["π DCG Calculation"]
A["For each position i:"]
B["relevance[i] / logβ(i + 1)"]
C["Sum all values"]
end
subgraph IDCG["π― IDCG (Ideal)"]
D["Sort by relevance desc"]
E["Calculate DCG on ideal order"]
end
subgraph NDCG["π Final Score"]
F["NDCG = DCG / IDCG"]
end
DCG --> F
IDCG --> F
style NDCG stroke:#f59e0b,stroke-width:2px
Formula:
Usage¶
from axion.metrics import NDCGAtK
# With graded relevance
item = DatasetItem(
actual_ranking=[{"id": "doc1"}, {"id": "doc2"}, {"id": "doc3"}],
expected_reference=[
{"id": "doc1", "relevance": 3.0}, # Highly relevant
{"id": "doc2", "relevance": 1.0}, # Marginally relevant
{"id": "doc3", "relevance": 2.0}, # Relevant
],
)
metric = NDCGAtK(k=[5, 10])
result = await metric.execute(item)
print(f"NDCG@10: {result.score:.3f}")
print(f"DCG: {result.signals.results_by_k[10].dcg:.3f}")
print(f"IDCG: {result.signals.results_by_k[10].idcg:.3f}")
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
k |
int \| List[int] |
10 |
Cutoff(s) for evaluation |
main_k |
int |
max(k) |
K value used for primary score |
Precision @ K¶
What fraction of the top K results are relevant?
At a Glance¶
π―
Score Range
Relevant / Retrieved
Score Range
0.0 ββββ 1.0Relevant / Retrieved
π
Default K
Top results to evaluate
Default K
10Top results to evaluate
How It Works¶
Precision@K = (Relevant docs in top K) / K
Examples (K=5):
- 5 relevant in top 5 β Precision = 1.0
- 3 relevant in top 5 β Precision = 0.6
- 0 relevant in top 5 β Precision = 0.0
Usage¶
from axion.metrics import PrecisionAtK
metric = PrecisionAtK(k=10)
result = await metric.execute(item)
print(f"Precision@10: {result.score:.2%}")
print(f"Hits: {result.signals.results_by_k[10].hits_in_top_k}")
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
k |
int \| List[int] |
10 |
Cutoff(s) for evaluation |
main_k |
int |
max(k) |
K value used for primary score |
Recall @ K¶
What fraction of ALL relevant documents appear in the top K?
At a Glance¶
π―
Score Range
Found / Total Relevant
Score Range
0.0 ββββ 1.0Found / Total Relevant
π
Default K
Top results to evaluate
Default K
10Top results to evaluate
How It Works¶
Recall@K = (Relevant docs in top K) / (Total relevant docs)
Examples (10 total relevant):
- 10 relevant in top K β Recall = 1.0
- 5 relevant in top K β Recall = 0.5
- 0 relevant in top K β Recall = 0.0
Usage¶
from axion.metrics import RecallAtK
metric = RecallAtK(k=[5, 10, 20])
result = await metric.execute(item)
print(f"Recall@10: {result.score:.2%}")
print(f"Found: {result.signals.results_by_k[10].hits_in_top_k}")
print(f"Total relevant: {result.signals.results_by_k[10].total_relevant}")
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
k |
int \| List[int] |
10 |
Cutoff(s) for evaluation |
main_k |
int |
max(k) |
K value used for primary score |
Comparison Guide¶
When to Use Each Metric¶
| Metric | Best For | Key Question |
|---|---|---|
| Hit Rate | Quick sanity check | "Did we find anything relevant?" |
| MRR | First-result systems | "How fast do users find what they need?" |
| NDCG | Graded relevance | "Is the ranking order optimal?" |
| Precision | Result quality | "Are results mostly relevant?" |
| Recall | Coverage | "Did we miss relevant docs?" |
Metric Relationships¶
flowchart TB
subgraph COVERAGE["Coverage Metrics"]
A[Hit Rate @ K]
B[Recall @ K]
end
subgraph QUALITY["Quality Metrics"]
C[Precision @ K]
D[MRR]
E[NDCG @ K]
end
A -->|"Binary version of"| B
C -->|"Complementary to"| B
D -->|"Position-aware"| E
style COVERAGE stroke:#3b82f6,stroke-width:2px
style QUALITY stroke:#10b981,stroke-width:2px
Complete Example¶
from axion.metrics import (
HitRateAtK,
MeanReciprocalRank,
NDCGAtK,
PrecisionAtK,
RecallAtK,
)
from axion.runners import MetricRunner
from axion.dataset import DatasetItem
# Create test item
item = DatasetItem(
actual_ranking=[
{"id": "doc1"}, # Relevant (relevance: 3)
{"id": "doc4"}, # Not relevant
{"id": "doc2"}, # Relevant (relevance: 2)
{"id": "doc5"}, # Not relevant
{"id": "doc3"}, # Relevant (relevance: 1)
],
expected_reference=[
{"id": "doc1", "relevance": 3.0},
{"id": "doc2", "relevance": 2.0},
{"id": "doc3", "relevance": 1.0},
],
)
# Evaluate with all metrics
metrics = [
HitRateAtK(k=5),
MeanReciprocalRank(),
NDCGAtK(k=5),
PrecisionAtK(k=5),
RecallAtK(k=5),
]
runner = MetricRunner(metrics=metrics)
results = await runner.run([item])
for result in results:
print(f"{result.metric_name}: {result.score:.3f}")
# Output:
# Hit Rate @ K: 1.000
# Mean Reciprocal Rank (MRR): 1.000
# NDCG @ K: 0.876
# Precision @ K: 0.600
# Recall @ K: 1.000
Quick Reference¶
TL;DR
| Metric | Formula | Score |
|---|---|---|
| Hit Rate | 1 if any relevant in K | 0 or 1 |
| MRR | 1 / first_relevant_rank | 0 to 1 |
| NDCG | DCG / IDCG | 0 to 1 |
| Precision | relevant_in_K / K | 0 to 1 |
| Recall | relevant_in_K / total_relevant | 0 to 1 |
-
API Reference
axion.metrics.HitRateAtKaxion.metrics.MeanReciprocalRankaxion.metrics.NDCGAtKaxion.metrics.PrecisionAtKaxion.metrics.RecallAtK -
Related Metrics
Contextual Precision Β· Contextual Recall Β· Contextual Ranking