Skip to content

Example Selector

The ExampleSelector provides intelligent selection of few-shot examples for LLM-as-judge calibration. Instead of naive slicing (examples[:n]), it offers strategies that improve calibration quality by balancing accept/reject cases, prioritizing misaligned examples, or covering discovered patterns.

What You'll Learn

1

Balanced Selection

50/50 accept/reject sampling for unbiased baselines — the default strategy.

2

Misalignment-Guided

Prioritize false positives and false negatives from prior evaluation runs to focus on hard cases.

3

Pattern-Aware

Sample from discovered failure patterns to ensure coverage across all failure categories.

4

Metric Integration

Convert selected examples to CaliberHQ dict format or Pydantic-based metric format.

Quick Start

from axion.caliber import ExampleSelector, SelectionStrategy

# Initialize with seed for reproducibility
selector = ExampleSelector(seed=42)

# Basic balanced selection
result = selector.select(records, annotations, count=6)

print(f"Selected {len(result.examples)} examples")
print(f"Strategy: {result.strategy_used}")
print(f"Metadata: {result.metadata}")

Selection Strategies

Selects a 50/50 mix of accept (score=1) and reject (score=0) cases with random sampling. This prevents the LLM judge from being biased toward one outcome.

result = selector.select(
    records=records,
    annotations=annotations,  # {record_id: 0 or 1}
    count=6,
    strategy=SelectionStrategy.BALANCED
)

# Metadata includes counts
# {'accepts': 3, 'rejects': 3}

When to use:

  • Initial calibration with no prior evaluation data
  • General-purpose few-shot selection
  • When you want unbiased baseline examples

Behavior with imbalanced data:

  • If all annotations are accepts, returns only accepts
  • If all annotations are rejects, returns only rejects
  • Fills remaining slots from the larger pool when one side is exhausted

Prioritizes cases where the LLM judge disagreed with human annotations. This focuses calibration on the hardest cases — the examples the judge gets wrong.

# Requires evaluation results from a prior run
eval_results = [
    {'id': 'rec_1', 'score': 1},  # LLM predicted 1
    {'id': 'rec_2', 'score': 0},  # LLM predicted 0
    # ...
]

result = selector.select(
    records=records,
    annotations=annotations,
    count=6,
    strategy=SelectionStrategy.MISALIGNMENT_GUIDED,
    eval_results=eval_results
)

# Metadata shows misalignment stats
# {
#     'false_positives_selected': 2,
#     'false_negatives_selected': 1,
#     'total_fp_available': 5,
#     'total_fn_available': 3,
# }

When to use:

  • After running an initial evaluation and seeing low agreement
  • To improve calibration on specific failure modes
  • Iterative refinement workflow

How it works:

  1. Identifies false positives (LLM=1, Human=0) and false negatives (LLM=0, Human=1)
  2. Allocates ~1/3 of slots each to FP and FN cases
  3. Fills remaining slots with balanced aligned examples

Samples from discovered patterns to ensure coverage across failure categories. Requires results from Pattern Discovery.

from axion.caliber import PatternDiscovery

# First, discover patterns in your annotations
discovery = PatternDiscovery(model_name='gpt-4o')
patterns_result = await discovery.discover(annotated_items)

# Then select examples covering those patterns
result = selector.select(
    records=records,
    annotations=annotations,
    count=6,
    strategy=SelectionStrategy.PATTERN_AWARE,
    patterns=patterns_result.patterns
)

# Metadata shows pattern coverage
# {
#     'patterns_covered': ['Missing Context', 'Factual Errors', 'Too Brief'],
#     'total_patterns': 5,
# }

When to use:

  • After Pattern Discovery reveals distinct failure categories
  • To ensure few-shot examples represent all failure types
  • When certain patterns are underrepresented in random selection

How it works:

  1. Takes one example from each discovered pattern (up to count)
  2. Fills remaining slots with balanced selection
  3. Handles patterns with unknown record IDs gracefully

Data Format

Records should be a list of dicts with an id or record_id field:

records = [
    {
        'id': 'rec_1',  # or 'record_id'
        'query': 'What is Python?',
        'actual_output': 'A programming language',
        # ... other fields
    },
    # ...
]

Annotations map record IDs to binary scores:

annotations = {
    'rec_1': 1,  # Accept
    'rec_2': 0,  # Reject
    'rec_3': 1,
    # ...
}

For MISALIGNMENT_GUIDED, provide the LLM's scores:

eval_results = [
    {'id': 'rec_1', 'score': 1},      # 'score' field
    {'id': 'rec_2', 'llm_score': 0},  # or 'llm_score' field
    # ...
]

Metric Integration

ExampleSelector returns generic dicts. The caller converts them to the format expected by their metric.

from axion.caliber import ExampleSelector, CaliberMetric

selector = ExampleSelector(seed=42)
result = selector.select(records, annotations, count=6)

# Format for CaliberHQ
examples = [
    {
        'input': {
            'query': r['query'],
            'actual_output': r['actual_output'],
            'expected_output': r.get('expected_output'),
        },
        'output': {
            'score': annotations[r['id']],
            'reason': r.get('human_reasoning', ''),
        }
    }
    for r in result.examples
]

evaluator = CaliberMetric(criteria="...", examples=examples)
from axion.caliber import ExampleSelector
from axion.metrics import Faithfulness
from axion.metrics.faithfulness import FaithfulnessInput
from axion.schema import MetricEvaluationResult

selector = ExampleSelector(seed=42)
result = selector.select(records, annotations, count=6)

# Convert to metric-specific Pydantic format
examples = [
    (
        FaithfulnessInput(
            query=r['query'],
            actual_output=r['actual_output'],
            retrieval_context=r['retrieval_context'],
        ),
        MetricEvaluationResult(
            score=annotations[r['id']],
            reason=r.get('human_reasoning', ''),
        )
    )
    for r in result.examples
]

metric = Faithfulness(examples=examples)

Best Practices

Strategy Selection

Use this decision tree to pick the right strategy:

graph TD
    Q1["Do you have evaluation results from a prior run?"]
    Q1 -->|"Yes"| Q2["Do you have Pattern Discovery results?"]
    Q1 -->|"No"| B["BALANCED<br/><small>Unbiased baseline</small>"]
    Q2 -->|"Yes"| P["PATTERN_AWARE<br/><small>Target failure modes</small>"]
    Q2 -->|"No"| M["MISALIGNMENT_GUIDED<br/><small>Focus on hard cases</small>"]

Auto-Selection Helper

def auto_select_strategy(eval_results=None, patterns=None):
    """Automatically choose the best strategy based on available data."""
    if patterns:
        return SelectionStrategy.PATTERN_AWARE
    elif eval_results:
        return SelectionStrategy.MISALIGNMENT_GUIDED
    else:
        return SelectionStrategy.BALANCED

Reproducibility

Always set a seed for reproducible results:

selector = ExampleSelector(seed=42)

Example Count

  • Start with 4–6 examples for initial calibration
  • Increase to 8–10 for complex criteria
  • More examples increase cost but may improve calibration