Example Selector¶
The ExampleSelector provides intelligent selection of few-shot examples for LLM-as-judge calibration. Instead of naive slicing (examples[:n]), it offers strategies that improve calibration quality by balancing accept/reject cases, prioritizing misaligned examples, or covering discovered patterns.
What You'll Learn¶
Balanced Selection
50/50 accept/reject sampling for unbiased baselines — the default strategy.
Misalignment-Guided
Prioritize false positives and false negatives from prior evaluation runs to focus on hard cases.
Pattern-Aware
Sample from discovered failure patterns to ensure coverage across all failure categories.
Metric Integration
Convert selected examples to CaliberHQ dict format or Pydantic-based metric format.
Quick Start¶
from axion.caliber import ExampleSelector, SelectionStrategy
# Initialize with seed for reproducibility
selector = ExampleSelector(seed=42)
# Basic balanced selection
result = selector.select(records, annotations, count=6)
print(f"Selected {len(result.examples)} examples")
print(f"Strategy: {result.strategy_used}")
print(f"Metadata: {result.metadata}")
Selection Strategies¶
Selects a 50/50 mix of accept (score=1) and reject (score=0) cases with random sampling. This prevents the LLM judge from being biased toward one outcome.
result = selector.select(
records=records,
annotations=annotations, # {record_id: 0 or 1}
count=6,
strategy=SelectionStrategy.BALANCED
)
# Metadata includes counts
# {'accepts': 3, 'rejects': 3}
When to use:
- Initial calibration with no prior evaluation data
- General-purpose few-shot selection
- When you want unbiased baseline examples
Behavior with imbalanced data:
- If all annotations are accepts, returns only accepts
- If all annotations are rejects, returns only rejects
- Fills remaining slots from the larger pool when one side is exhausted
Prioritizes cases where the LLM judge disagreed with human annotations. This focuses calibration on the hardest cases — the examples the judge gets wrong.
# Requires evaluation results from a prior run
eval_results = [
{'id': 'rec_1', 'score': 1}, # LLM predicted 1
{'id': 'rec_2', 'score': 0}, # LLM predicted 0
# ...
]
result = selector.select(
records=records,
annotations=annotations,
count=6,
strategy=SelectionStrategy.MISALIGNMENT_GUIDED,
eval_results=eval_results
)
# Metadata shows misalignment stats
# {
# 'false_positives_selected': 2,
# 'false_negatives_selected': 1,
# 'total_fp_available': 5,
# 'total_fn_available': 3,
# }
When to use:
- After running an initial evaluation and seeing low agreement
- To improve calibration on specific failure modes
- Iterative refinement workflow
How it works:
- Identifies false positives (LLM=1, Human=0) and false negatives (LLM=0, Human=1)
- Allocates ~1/3 of slots each to FP and FN cases
- Fills remaining slots with balanced aligned examples
Samples from discovered patterns to ensure coverage across failure categories. Requires results from Pattern Discovery.
from axion.caliber import PatternDiscovery
# First, discover patterns in your annotations
discovery = PatternDiscovery(model_name='gpt-4o')
patterns_result = await discovery.discover(annotated_items)
# Then select examples covering those patterns
result = selector.select(
records=records,
annotations=annotations,
count=6,
strategy=SelectionStrategy.PATTERN_AWARE,
patterns=patterns_result.patterns
)
# Metadata shows pattern coverage
# {
# 'patterns_covered': ['Missing Context', 'Factual Errors', 'Too Brief'],
# 'total_patterns': 5,
# }
When to use:
- After Pattern Discovery reveals distinct failure categories
- To ensure few-shot examples represent all failure types
- When certain patterns are underrepresented in random selection
How it works:
- Takes one example from each discovered pattern (up to
count) - Fills remaining slots with balanced selection
- Handles patterns with unknown record IDs gracefully
Data Format¶
Records should be a list of dicts with an id or record_id field:
Annotations map record IDs to binary scores:
Metric Integration¶
ExampleSelector returns generic dicts. The caller converts them to the format expected by their metric.
from axion.caliber import ExampleSelector, CaliberMetric
selector = ExampleSelector(seed=42)
result = selector.select(records, annotations, count=6)
# Format for CaliberHQ
examples = [
{
'input': {
'query': r['query'],
'actual_output': r['actual_output'],
'expected_output': r.get('expected_output'),
},
'output': {
'score': annotations[r['id']],
'reason': r.get('human_reasoning', ''),
}
}
for r in result.examples
]
evaluator = CaliberMetric(criteria="...", examples=examples)
from axion.caliber import ExampleSelector
from axion.metrics import Faithfulness
from axion.metrics.faithfulness import FaithfulnessInput
from axion.schema import MetricEvaluationResult
selector = ExampleSelector(seed=42)
result = selector.select(records, annotations, count=6)
# Convert to metric-specific Pydantic format
examples = [
(
FaithfulnessInput(
query=r['query'],
actual_output=r['actual_output'],
retrieval_context=r['retrieval_context'],
),
MetricEvaluationResult(
score=annotations[r['id']],
reason=r.get('human_reasoning', ''),
)
)
for r in result.examples
]
metric = Faithfulness(examples=examples)
Best Practices¶
Strategy Selection¶
Use this decision tree to pick the right strategy:
graph TD
Q1["Do you have evaluation results from a prior run?"]
Q1 -->|"Yes"| Q2["Do you have Pattern Discovery results?"]
Q1 -->|"No"| B["BALANCED<br/><small>Unbiased baseline</small>"]
Q2 -->|"Yes"| P["PATTERN_AWARE<br/><small>Target failure modes</small>"]
Q2 -->|"No"| M["MISALIGNMENT_GUIDED<br/><small>Focus on hard cases</small>"]
Auto-Selection Helper¶
def auto_select_strategy(eval_results=None, patterns=None):
"""Automatically choose the best strategy based on available data."""
if patterns:
return SelectionStrategy.PATTERN_AWARE
elif eval_results:
return SelectionStrategy.MISALIGNMENT_GUIDED
else:
return SelectionStrategy.BALANCED
Reproducibility¶
Always set a seed for reproducible results:
Example Count¶
- Start with 4–6 examples for initial calibration
- Increase to 8–10 for complex criteria
- More examples increase cost but may improve calibration