Skip to content

Caliber API Reference

LLM-as-judge calibration and alignment tools for improving evaluation quality.

from axion.caliber import (
    CaliberMetric,
    ExampleSelector, SelectionStrategy, SelectionResult,
    PatternDiscovery, DiscoveredPattern, PatternDiscoveryResult,
    EvidenceItem, EvidencePipeline, LearningArtifact, PipelineResult,
    Provenance, MetadataConfig, ClusteringMethod, AnnotatedItem,
    InMemorySink, JsonlSink, InMemoryDeduper, EmbeddingDeduper,
    MisalignmentAnalyzer, MisalignmentAnalysis, MisalignmentPattern,
    PromptOptimizer, OptimizedPrompt, PromptSuggestion,
    CaliberRenderer, NotebookCaliberRenderer,
    ConsoleCaliberRenderer, JsonCaliberRenderer,
)
C

CaliberMetric

Core metric for measuring LLM judge alignment against human ground truth scores.

P

Pattern Discovery

Cluster any text evidence into themes and distill actionable learning artifacts via a full pipeline.

M

Misalignment Analysis

Identify systematic disagreements between human and LLM judges and surface root causes.

O

Prompt Optimization

Automatically generate improved evaluation prompts that better align with human judgment.


CaliberMetric

axion.caliber.CaliberMetric

CaliberMetric(instruction: str, model_name: Optional[str] = None, llm_provider: Optional[str] = None, examples: Optional[List[Dict]] = None, required_fields: Optional[List[str]] = None, **kwargs)

Bases: BaseMetric

Dynamically configured metric for CaliberHQ evaluation.

This metric uses the provided criteria/instruction for LLM-as-judge evaluation.

Initialize the CaliberMetric.

Parameters:

  • instruction (str) –

    The LLM-as-a-judge prompt/criteria

  • model_name (Optional[str], default: None ) –

    The name of the LLM to use

  • llm_provider (Optional[str], default: None ) –

    The provider of the LLM

  • examples (Optional[List[Dict]], default: None ) –

    Few-shot examples from the UI

  • required_fields (Optional[List[str]], default: None ) –

    Required input fields for the metric


Example Selection

ExampleSelector

axion.caliber.ExampleSelector

ExampleSelector(seed: Optional[int] = None)

Selects few-shot examples for LLM-as-judge calibration.

Example

selector = ExampleSelector()

Simple balanced selection

result = selector.select(records, annotations, count=6)

Misalignment-guided (requires eval results)

result = selector.select( ... records, annotations, count=6, ... strategy=SelectionStrategy.MISALIGNMENT_GUIDED, ... eval_results=results ... )

Pattern-aware (requires Pattern Discovery results)

result = selector.select( ... records, annotations, count=6, ... strategy=SelectionStrategy.PATTERN_AWARE, ... patterns=discovered_patterns ... )

Initialize selector.

Parameters:

  • seed (Optional[int], default: None ) –

    Random seed for reproducibility

select

select(records: List[Dict[str, Any]], annotations: Dict[str, int], count: int = 6, strategy: SelectionStrategy = BALANCED, eval_results: Optional[List[Dict[str, Any]]] = None, patterns: Optional[List[DiscoveredPattern]] = None) -> SelectionResult

Select few-shot examples.

Parameters:

  • records (List[Dict[str, Any]]) –

    List of records with 'id', 'query', 'actual_output', etc.

  • annotations (Dict[str, int]) –

    Dict mapping record_id -> human score (0 or 1)

  • count (int, default: 6 ) –

    Number of examples to select

  • strategy (SelectionStrategy, default: BALANCED ) –

    Selection strategy to use

  • eval_results (Optional[List[Dict[str, Any]]], default: None ) –

    Evaluation results (required for MISALIGNMENT_GUIDED)

  • patterns (Optional[List[DiscoveredPattern]], default: None ) –

    Discovered patterns (required for PATTERN_AWARE)

Returns:

SelectionStrategy

axion.caliber.SelectionStrategy

Bases: str, Enum

Strategies for selecting few-shot examples.

SelectionResult

axion.caliber.SelectionResult dataclass

SelectionResult(examples: List[Dict[str, Any]], strategy_used: SelectionStrategy, metadata: Dict[str, Any])

Result of example selection.


Pattern Discovery

PatternDiscovery

axion.caliber.PatternDiscovery

PatternDiscovery(model_name: Optional[str] = None, llm=None, llm_provider: Optional[str] = None, instruction: Optional[str] = None, max_notes: int = 50, min_category_size: int = 2, bertopic_embedding_model: Any = 'all-MiniLM-L6-v2', metadata_config: Optional[MetadataConfig] = None, excerpt_fn: Optional[ExcerptFn] = None, seed: Optional[int] = None, tracer: Optional[Any] = None)

Discovers patterns in evaluation annotations using LLM-based clustering.

This class leverages LLMHandler for structured output, automatic retries, and consistent LLM configuration with the rest of axion.

Supports both the legacy discover() API (AnnotatedItem dicts) and the new discover_from_evidence() API (EvidenceItem sequences/dicts).

Example

from axion.caliber import PatternDiscovery, AnnotatedItem

annotations = { ... 'rec_1': AnnotatedItem(record_id='rec_1', score=0, notes='Missing context'), ... 'rec_2': AnnotatedItem(record_id='rec_2', score=0, notes='Lacks detail'), ... } discovery = PatternDiscovery(model_name='gpt-4o', llm_provider='openai') result = await discovery.discover(annotations)

discover async

discover(annotations: Union[Dict[str, AnnotatedItem], Dict[str, Dict]], method: ClusteringMethod = LLM) -> PatternDiscoveryResult

Backward-compatible entry point.

Normalizes AnnotatedItem dicts into EvidenceItem dicts and delegates to discover_from_evidence().

discover_from_evidence async

discover_from_evidence(evidence: Union[Sequence[EvidenceItem], Dict[str, EvidenceItem]], method: ClusteringMethod = LLM) -> PatternDiscoveryResult

Discover patterns from generic evidence items.

EvidenceItem

axion.caliber.EvidenceItem dataclass

EvidenceItem(id: str, text: str, metadata: Dict[str, Any] = dict(), source_ref: Optional[str] = None)

A single piece of evidence for clustering.

Represents any text source (conversation, bug report, eval note, etc.) with optional structured metadata and provenance.

DiscoveredPattern

axion.caliber.DiscoveredPattern dataclass

DiscoveredPattern(category: str, description: str, count: int, record_ids: List[str], examples: List[str] = list(), confidence: Optional[float] = None)

A discovered pattern/category from clustering.

PatternDiscoveryResult

axion.caliber.PatternDiscoveryResult dataclass

PatternDiscoveryResult(patterns: List[DiscoveredPattern], uncategorized: List[str], total_analyzed: int, method: ClusteringMethod, metadata: Dict[str, Any] = dict())

Complete result from pattern discovery.

ClusteringMethod

axion.caliber.ClusteringMethod

Bases: str, Enum

Available clustering methods.

EvidencePipeline

axion.caliber.EvidencePipeline

EvidencePipeline(model_name: Optional[str] = None, llm=None, llm_provider: Optional[str] = None, clustering_instruction: Optional[str] = None, distillation_instruction: Optional[str] = None, clusterer: Optional[EvidenceClusterer] = None, writer: Optional[ArtifactWriter] = None, method: ClusteringMethod = LLM, recurrence_threshold: int = 2, recurrence_key_fn: Optional[RecurrenceKeyFn] = None, max_learnings_per_cluster: int = 3, max_items: int = 50, min_category_size: int = 2, domain_context: Optional[str] = None, metadata_config: Optional[MetadataConfig] = None, excerpt_fn: Optional[ExcerptFn] = None, seed: Optional[int] = None, max_concurrent_distillations: int = 5, sanitizer: Optional[Sanitizer] = None, sink: Optional[ArtifactSink] = None, deduper: Optional[Deduper] = None, tag_normalizer: Optional[Callable[[List[str]], List[str]]] = None, bertopic_embedding_model: Any = 'all-MiniLM-L6-v2', tracer: Optional[Any] = None)

Orchestrates evidence → clusters → KB-ready learnings.

display

display(result: PipelineResult) -> None

Display a pipeline result (summary + patterns + learnings).

Auto-detects Jupyter notebook vs console environment.

LearningArtifact

axion.caliber.LearningArtifact dataclass

LearningArtifact(title: str, content: str, tags: List[str], confidence: float, supporting_item_ids: List[str], recommended_actions: List[str] = list(), counterexamples: List[str] = list(), scope: Optional[str] = None, when_not_to_apply: Optional[str] = None)

A synthesized insight distilled from a cluster of evidence.

PipelineResult

axion.caliber.PipelineResult dataclass

PipelineResult(clustering_result: PatternDiscoveryResult, learnings: List[LearningArtifact], filtered_count: int = 0, deduplicated_count: int = 0, validation_repairs: int = 0, sink_ids: List[str] = list(), metadata: Dict[str, Any] = dict())

Complete result from the evidence pipeline.

Provenance

axion.caliber.Provenance dataclass

Provenance(source_ref: Optional[str] = None, clustering_method: Optional[str] = None, total_analyzed: int = 0, supporting_count: int = 0, cluster_category: Optional[str] = None, timestamp: Optional[str] = None, metadata: Dict[str, Any] = dict())

Structured provenance attached to learning artifacts for sinks.

MetadataConfig

axion.caliber.MetadataConfig dataclass

MetadataConfig(allowed_keys: Optional[Set[str]] = None, denied_keys: Set[str] = (lambda: set(DEFAULT_DENIED_KEYS))(), max_keys: int = 6, max_value_length: int = 50, max_header_chars: int = 150, include_in_clustering: bool = False, include_in_distillation: bool = True)

Configuration for metadata handling in clustering and distillation.

Sinks & Dedupers

InMemorySink

axion.caliber.InMemorySink

InMemorySink()

Dict-based in-memory sink for testing/prototyping.

JsonlSink

axion.caliber.JsonlSink

JsonlSink(file_path: str)

Appends learning artifacts as JSON lines to a file.

InMemoryDeduper

axion.caliber.InMemoryDeduper

InMemoryDeduper()

Title-based case-insensitive deduplication for testing.

EmbeddingDeduper

axion.caliber.EmbeddingDeduper

EmbeddingDeduper(embed_model=None, embed_model_name: str = 'text-embedding-3-small', similarity_threshold: float = 0.85, max_stored: int = 1000, reset_per_run: bool = True)

Embedding-based cosine similarity deduplication.

Requires axion[embeddings] extra. Raises ImportError with a clear message if dependencies are unavailable (NO silent fallback).

AnnotatedItem (Legacy)

axion.caliber.AnnotatedItem dataclass

AnnotatedItem(record_id: str, score: int, notes: Optional[str] = None, timestamp: Optional[str] = None, query: Optional[str] = None, actual_output: Optional[str] = None)

A single annotated item with optional notes.


Misalignment Analysis

MisalignmentAnalyzer

axion.caliber.MisalignmentAnalyzer

MisalignmentAnalyzer(model_name: Optional[str] = None, llm: Optional[LLMRunnable] = None, llm_provider: Optional[str] = None, instruction: Optional[str] = None, max_examples: int = 10)

Analyzes misalignment between LLM judges and human annotators.

This class leverages LLMHandler for structured output, automatic retries, and consistent LLM configuration with the rest of axion.

Example

from axion.caliber import MisalignmentAnalyzer

results = [ ... {'record_id': 'r1', 'human_score': 1, 'llm_score': 0, ... 'query': '...', 'actual_output': '...', 'llm_reasoning': '...'}, ... {'record_id': 'r2', 'human_score': 0, 'llm_score': 1, ... 'query': '...', 'actual_output': '...', 'llm_reasoning': '...'}, ... ] criteria = "Evaluate whether the response is accurate and helpful."

analyzer = MisalignmentAnalyzer(model_name='gpt-4o', llm_provider='openai') analysis = await analyzer.analyze(results, criteria)

print(f'Summary: {analysis.summary}') print(f'Recommendations: {analysis.recommendations}')

Initialize MisalignmentAnalyzer.

Parameters:

  • model_name (Optional[str], default: None ) –

    Name of the LLM model (e.g., 'gpt-4o', 'claude-sonnet-4-20250514')

  • llm (Optional[LLMRunnable], default: None ) –

    Pre-configured LLM instance

  • llm_provider (Optional[str], default: None ) –

    LLM provider ('openai', 'anthropic')

  • instruction (Optional[str], default: None ) –

    Custom instruction to override default analysis prompt

  • max_examples (int, default: 10 ) –

    Max examples per category to include in analysis

analyze async

analyze(results: Union[List[Dict[str, Any]], List[Any]], evaluation_criteria: str) -> MisalignmentAnalysis

Analyze misalignment patterns asynchronously.

Parameters:

  • results (Union[List[Dict[str, Any]], List[Any]]) –

    List of evaluation results with human_score, llm_score, etc.

  • evaluation_criteria (str) –

    The current evaluation criteria being used

Returns:

MisalignmentAnalysis

axion.caliber.MisalignmentAnalysis dataclass

MisalignmentAnalysis(total_misaligned: int, false_positives: int, false_negatives: int, patterns: List[MisalignmentPattern], summary: str, recommendations: List[str], metadata: Dict[str, Any] = dict())

Complete result from misalignment analysis.

MisalignmentPattern

axion.caliber.MisalignmentPattern dataclass

MisalignmentPattern(pattern_type: str, description: str, count: int, example_ids: List[str])

A discovered pattern in misalignment analysis.


Prompt Optimization

PromptOptimizer

axion.caliber.PromptOptimizer

PromptOptimizer(model_name: Optional[str] = None, llm: Optional[LLMRunnable] = None, llm_provider: Optional[str] = None, instruction: Optional[str] = None, max_examples: int = 10)

Optimizes evaluation prompts based on misalignment analysis.

This class leverages LLMHandler for structured output, automatic retries, and consistent LLM configuration with the rest of axion.

Example

from axion.caliber import PromptOptimizer

results = [ ... {'record_id': 'r1', 'human_score': 1, 'llm_score': 0, ... 'query': '...', 'actual_output': '...', 'llm_reasoning': '...'}, ... ] criteria = "Evaluate whether the response is accurate." system_prompt = "You are an evaluator..."

optimizer = PromptOptimizer(model_name='gpt-4o', llm_provider='openai') optimized = await optimizer.optimize(results, criteria, system_prompt)

print(f'Optimized criteria: {optimized.optimized_criteria}')

Initialize PromptOptimizer.

Parameters:

  • model_name (Optional[str], default: None ) –

    Name of the LLM model (e.g., 'gpt-4o', 'claude-sonnet-4-20250514')

  • llm (Optional[LLMRunnable], default: None ) –

    Pre-configured LLM instance

  • llm_provider (Optional[str], default: None ) –

    LLM provider ('openai', 'anthropic')

  • instruction (Optional[str], default: None ) –

    Custom instruction to override default optimization prompt

  • max_examples (int, default: 10 ) –

    Max examples per category to include in optimization

optimize async

optimize(results: Union[List[Dict[str, Any]], List[Any]], current_criteria: str, system_prompt: str = '') -> OptimizedPrompt

Optimize evaluation criteria asynchronously.

Parameters:

  • results (Union[List[Dict[str, Any]], List[Any]]) –

    List of evaluation results with human_score, llm_score, etc.

  • current_criteria (str) –

    The current evaluation criteria to improve

  • system_prompt (str, default: '' ) –

    The current system prompt (optional)

Returns:

  • OptimizedPrompt

    OptimizedPrompt with improved criteria and suggestions

OptimizedPrompt

axion.caliber.OptimizedPrompt dataclass

OptimizedPrompt(original_criteria: str, optimized_criteria: str, suggestions: List[PromptSuggestion], expected_improvement: str, metadata: Dict[str, Any] = dict())

Complete result from prompt optimization.

PromptSuggestion

axion.caliber.PromptSuggestion dataclass

PromptSuggestion(aspect: str, suggestion: str, rationale: str)

A single suggestion for prompt improvement.


Renderers

CaliberRenderer

axion.caliber.CaliberRenderer

Bases: ABC

Abstract renderer interface for CaliberHQ.

Implementations provide UI rendering for different environments: - NotebookCaliberRenderer: Jupyter notebooks with rich styling - ConsoleCaliberRenderer: Terminal/CLI output - JsonCaliberRenderer: JSON-only (for web APIs)

render_record abstractmethod

render_record(record: UploadedRecord, annotation: Optional[Annotation] = None) -> None

Render a single record for annotation.

Parameters:

  • record (UploadedRecord) –

    The record to display

  • annotation (Optional[Annotation], default: None ) –

    Existing annotation if any

render_annotation_progress abstractmethod

render_annotation_progress(state: AnnotationState) -> None

Render annotation progress.

Parameters:

  • state (AnnotationState) –

    Current annotation state

render_evaluation_result abstractmethod

render_evaluation_result(result: EvaluationResult) -> None

Render evaluation results.

Parameters:

  • result (EvaluationResult) –

    Evaluation result to display

render_misalignment_analysis abstractmethod

render_misalignment_analysis(analysis: 'MisalignmentAnalysis') -> None

Render misalignment analysis.

Parameters:

  • analysis ('MisalignmentAnalysis') –

    Analysis result to display

NotebookCaliberRenderer

axion.caliber.NotebookCaliberRenderer

Bases: CaliberRenderer

Notebook-specific renderer for CaliberHQ.

Provides rich HTML and styled pandas output for Jupyter notebooks.

render_record

render_record(record: UploadedRecord, annotation: Optional[Annotation] = None) -> None

Render a single record for annotation.

render_annotation_progress

render_annotation_progress(state: AnnotationState) -> None

Render annotation progress.

render_evaluation_result

render_evaluation_result(result: EvaluationResult) -> None

Render evaluation results with styled tables.

render_misalignment_analysis

render_misalignment_analysis(analysis: 'MisalignmentAnalysis') -> None

Render misalignment analysis.

ConsoleCaliberRenderer

axion.caliber.ConsoleCaliberRenderer

Bases: CaliberRenderer

Console/terminal renderer for CaliberHQ.

Provides text-based output suitable for CLI environments.

render_record

render_record(record: UploadedRecord, annotation: Optional[Annotation] = None) -> None

Render a single record for annotation.

render_annotation_progress

render_annotation_progress(state: AnnotationState) -> None

Render annotation progress.

render_evaluation_result

render_evaluation_result(result: EvaluationResult) -> None

Render evaluation results.

render_misalignment_analysis

render_misalignment_analysis(analysis: 'MisalignmentAnalysis') -> None

Render misalignment analysis.

JsonCaliberRenderer

axion.caliber.JsonCaliberRenderer

Bases: CaliberRenderer

JSON-first renderer for CaliberHQ.

Provides no-op rendering for API environments where data is consumed as JSON rather than displayed.

render_record

render_record(record: UploadedRecord, annotation: Optional[Annotation] = None) -> None

No-op: JSON renderer does not display records.

render_annotation_progress

render_annotation_progress(state: AnnotationState) -> None

No-op: JSON renderer does not display progress.

render_evaluation_result

render_evaluation_result(result: EvaluationResult) -> None

No-op: JSON renderer does not display results.

render_misalignment_analysis

render_misalignment_analysis(analysis: 'MisalignmentAnalysis') -> None

No-op: JSON renderer does not display analysis.