Caliber API Reference¶
LLM-as-judge calibration and alignment tools for improving evaluation quality.
from axion.caliber import (
CaliberMetric,
ExampleSelector, SelectionStrategy, SelectionResult,
PatternDiscovery, DiscoveredPattern, PatternDiscoveryResult,
EvidenceItem, EvidencePipeline, LearningArtifact, PipelineResult,
Provenance, MetadataConfig, ClusteringMethod, AnnotatedItem,
InMemorySink, JsonlSink, InMemoryDeduper, EmbeddingDeduper,
MisalignmentAnalyzer, MisalignmentAnalysis, MisalignmentPattern,
PromptOptimizer, OptimizedPrompt, PromptSuggestion,
CaliberRenderer, NotebookCaliberRenderer,
ConsoleCaliberRenderer, JsonCaliberRenderer,
)
CaliberMetric
Core metric for measuring LLM judge alignment against human ground truth scores.
Pattern Discovery
Cluster any text evidence into themes and distill actionable learning artifacts via a full pipeline.
Misalignment Analysis
Identify systematic disagreements between human and LLM judges and surface root causes.
Prompt Optimization
Automatically generate improved evaluation prompts that better align with human judgment.
CaliberMetric¶
axion.caliber.CaliberMetric ¶
CaliberMetric(instruction: str, model_name: Optional[str] = None, llm_provider: Optional[str] = None, examples: Optional[List[Dict]] = None, required_fields: Optional[List[str]] = None, **kwargs)
Bases: BaseMetric
Dynamically configured metric for CaliberHQ evaluation.
This metric uses the provided criteria/instruction for LLM-as-judge evaluation.
Initialize the CaliberMetric.
Parameters:
-
instruction(str) –The LLM-as-a-judge prompt/criteria
-
model_name(Optional[str], default:None) –The name of the LLM to use
-
llm_provider(Optional[str], default:None) –The provider of the LLM
-
examples(Optional[List[Dict]], default:None) –Few-shot examples from the UI
-
required_fields(Optional[List[str]], default:None) –Required input fields for the metric
Example Selection¶
ExampleSelector¶
axion.caliber.ExampleSelector ¶
Selects few-shot examples for LLM-as-judge calibration.
Example
selector = ExampleSelector()
Simple balanced selection¶
result = selector.select(records, annotations, count=6)
Misalignment-guided (requires eval results)¶
result = selector.select( ... records, annotations, count=6, ... strategy=SelectionStrategy.MISALIGNMENT_GUIDED, ... eval_results=results ... )
Pattern-aware (requires Pattern Discovery results)¶
result = selector.select( ... records, annotations, count=6, ... strategy=SelectionStrategy.PATTERN_AWARE, ... patterns=discovered_patterns ... )
Initialize selector.
Parameters:
-
seed(Optional[int], default:None) –Random seed for reproducibility
select ¶
select(records: List[Dict[str, Any]], annotations: Dict[str, int], count: int = 6, strategy: SelectionStrategy = BALANCED, eval_results: Optional[List[Dict[str, Any]]] = None, patterns: Optional[List[DiscoveredPattern]] = None) -> SelectionResult
Select few-shot examples.
Parameters:
-
records(List[Dict[str, Any]]) –List of records with 'id', 'query', 'actual_output', etc.
-
annotations(Dict[str, int]) –Dict mapping record_id -> human score (0 or 1)
-
count(int, default:6) –Number of examples to select
-
strategy(SelectionStrategy, default:BALANCED) –Selection strategy to use
-
eval_results(Optional[List[Dict[str, Any]]], default:None) –Evaluation results (required for MISALIGNMENT_GUIDED)
-
patterns(Optional[List[DiscoveredPattern]], default:None) –Discovered patterns (required for PATTERN_AWARE)
Returns:
-
SelectionResult–SelectionResult with selected examples and metadata
SelectionStrategy¶
axion.caliber.SelectionStrategy ¶
Bases: str, Enum
Strategies for selecting few-shot examples.
SelectionResult¶
axion.caliber.SelectionResult
dataclass
¶
SelectionResult(examples: List[Dict[str, Any]], strategy_used: SelectionStrategy, metadata: Dict[str, Any])
Result of example selection.
Pattern Discovery¶
PatternDiscovery¶
axion.caliber.PatternDiscovery ¶
PatternDiscovery(model_name: Optional[str] = None, llm=None, llm_provider: Optional[str] = None, instruction: Optional[str] = None, max_notes: int = 50, min_category_size: int = 2, bertopic_embedding_model: Any = 'all-MiniLM-L6-v2', metadata_config: Optional[MetadataConfig] = None, excerpt_fn: Optional[ExcerptFn] = None, seed: Optional[int] = None, tracer: Optional[Any] = None)
Discovers patterns in evaluation annotations using LLM-based clustering.
This class leverages LLMHandler for structured output, automatic retries, and consistent LLM configuration with the rest of axion.
Supports both the legacy discover() API (AnnotatedItem dicts) and the
new discover_from_evidence() API (EvidenceItem sequences/dicts).
Example
from axion.caliber import PatternDiscovery, AnnotatedItem
annotations = { ... 'rec_1': AnnotatedItem(record_id='rec_1', score=0, notes='Missing context'), ... 'rec_2': AnnotatedItem(record_id='rec_2', score=0, notes='Lacks detail'), ... } discovery = PatternDiscovery(model_name='gpt-4o', llm_provider='openai') result = await discovery.discover(annotations)
discover
async
¶
discover(annotations: Union[Dict[str, AnnotatedItem], Dict[str, Dict]], method: ClusteringMethod = LLM) -> PatternDiscoveryResult
Backward-compatible entry point.
Normalizes AnnotatedItem dicts into EvidenceItem dicts and
delegates to discover_from_evidence().
discover_from_evidence
async
¶
discover_from_evidence(evidence: Union[Sequence[EvidenceItem], Dict[str, EvidenceItem]], method: ClusteringMethod = LLM) -> PatternDiscoveryResult
Discover patterns from generic evidence items.
EvidenceItem¶
axion.caliber.EvidenceItem
dataclass
¶
EvidenceItem(id: str, text: str, metadata: Dict[str, Any] = dict(), source_ref: Optional[str] = None)
A single piece of evidence for clustering.
Represents any text source (conversation, bug report, eval note, etc.) with optional structured metadata and provenance.
DiscoveredPattern¶
axion.caliber.DiscoveredPattern
dataclass
¶
DiscoveredPattern(category: str, description: str, count: int, record_ids: List[str], examples: List[str] = list(), confidence: Optional[float] = None)
A discovered pattern/category from clustering.
PatternDiscoveryResult¶
axion.caliber.PatternDiscoveryResult
dataclass
¶
PatternDiscoveryResult(patterns: List[DiscoveredPattern], uncategorized: List[str], total_analyzed: int, method: ClusteringMethod, metadata: Dict[str, Any] = dict())
Complete result from pattern discovery.
ClusteringMethod¶
axion.caliber.ClusteringMethod ¶
Bases: str, Enum
Available clustering methods.
EvidencePipeline¶
axion.caliber.EvidencePipeline ¶
EvidencePipeline(model_name: Optional[str] = None, llm=None, llm_provider: Optional[str] = None, clustering_instruction: Optional[str] = None, distillation_instruction: Optional[str] = None, clusterer: Optional[EvidenceClusterer] = None, writer: Optional[ArtifactWriter] = None, method: ClusteringMethod = LLM, recurrence_threshold: int = 2, recurrence_key_fn: Optional[RecurrenceKeyFn] = None, max_learnings_per_cluster: int = 3, max_items: int = 50, min_category_size: int = 2, domain_context: Optional[str] = None, metadata_config: Optional[MetadataConfig] = None, excerpt_fn: Optional[ExcerptFn] = None, seed: Optional[int] = None, max_concurrent_distillations: int = 5, sanitizer: Optional[Sanitizer] = None, sink: Optional[ArtifactSink] = None, deduper: Optional[Deduper] = None, tag_normalizer: Optional[Callable[[List[str]], List[str]]] = None, bertopic_embedding_model: Any = 'all-MiniLM-L6-v2', tracer: Optional[Any] = None)
Orchestrates evidence → clusters → KB-ready learnings.
display ¶
display(result: PipelineResult) -> None
Display a pipeline result (summary + patterns + learnings).
Auto-detects Jupyter notebook vs console environment.
LearningArtifact¶
axion.caliber.LearningArtifact
dataclass
¶
LearningArtifact(title: str, content: str, tags: List[str], confidence: float, supporting_item_ids: List[str], recommended_actions: List[str] = list(), counterexamples: List[str] = list(), scope: Optional[str] = None, when_not_to_apply: Optional[str] = None)
A synthesized insight distilled from a cluster of evidence.
PipelineResult¶
axion.caliber.PipelineResult
dataclass
¶
PipelineResult(clustering_result: PatternDiscoveryResult, learnings: List[LearningArtifact], filtered_count: int = 0, deduplicated_count: int = 0, validation_repairs: int = 0, sink_ids: List[str] = list(), metadata: Dict[str, Any] = dict())
Complete result from the evidence pipeline.
Provenance¶
axion.caliber.Provenance
dataclass
¶
Provenance(source_ref: Optional[str] = None, clustering_method: Optional[str] = None, total_analyzed: int = 0, supporting_count: int = 0, cluster_category: Optional[str] = None, timestamp: Optional[str] = None, metadata: Dict[str, Any] = dict())
Structured provenance attached to learning artifacts for sinks.
MetadataConfig¶
axion.caliber.MetadataConfig
dataclass
¶
MetadataConfig(allowed_keys: Optional[Set[str]] = None, denied_keys: Set[str] = (lambda: set(DEFAULT_DENIED_KEYS))(), max_keys: int = 6, max_value_length: int = 50, max_header_chars: int = 150, include_in_clustering: bool = False, include_in_distillation: bool = True)
Configuration for metadata handling in clustering and distillation.
Sinks & Dedupers¶
InMemorySink¶
JsonlSink¶
axion.caliber.JsonlSink ¶
Appends learning artifacts as JSON lines to a file.
InMemoryDeduper¶
axion.caliber.InMemoryDeduper ¶
Title-based case-insensitive deduplication for testing.
EmbeddingDeduper¶
axion.caliber.EmbeddingDeduper ¶
EmbeddingDeduper(embed_model=None, embed_model_name: str = 'text-embedding-3-small', similarity_threshold: float = 0.85, max_stored: int = 1000, reset_per_run: bool = True)
Embedding-based cosine similarity deduplication.
Requires axion[embeddings] extra. Raises ImportError with a clear
message if dependencies are unavailable (NO silent fallback).
AnnotatedItem (Legacy)¶
axion.caliber.AnnotatedItem
dataclass
¶
AnnotatedItem(record_id: str, score: int, notes: Optional[str] = None, timestamp: Optional[str] = None, query: Optional[str] = None, actual_output: Optional[str] = None)
A single annotated item with optional notes.
Misalignment Analysis¶
MisalignmentAnalyzer¶
axion.caliber.MisalignmentAnalyzer ¶
MisalignmentAnalyzer(model_name: Optional[str] = None, llm: Optional[LLMRunnable] = None, llm_provider: Optional[str] = None, instruction: Optional[str] = None, max_examples: int = 10)
Analyzes misalignment between LLM judges and human annotators.
This class leverages LLMHandler for structured output, automatic retries, and consistent LLM configuration with the rest of axion.
Example
from axion.caliber import MisalignmentAnalyzer
results = [ ... {'record_id': 'r1', 'human_score': 1, 'llm_score': 0, ... 'query': '...', 'actual_output': '...', 'llm_reasoning': '...'}, ... {'record_id': 'r2', 'human_score': 0, 'llm_score': 1, ... 'query': '...', 'actual_output': '...', 'llm_reasoning': '...'}, ... ] criteria = "Evaluate whether the response is accurate and helpful."
Using model_name/provider (recommended)¶
analyzer = MisalignmentAnalyzer(model_name='gpt-4o', llm_provider='openai') analysis = await analyzer.analyze(results, criteria)
print(f'Summary: {analysis.summary}') print(f'Recommendations: {analysis.recommendations}')
Initialize MisalignmentAnalyzer.
Parameters:
-
model_name(Optional[str], default:None) –Name of the LLM model (e.g., 'gpt-4o', 'claude-sonnet-4-20250514')
-
llm(Optional[LLMRunnable], default:None) –Pre-configured LLM instance
-
llm_provider(Optional[str], default:None) –LLM provider ('openai', 'anthropic')
-
instruction(Optional[str], default:None) –Custom instruction to override default analysis prompt
-
max_examples(int, default:10) –Max examples per category to include in analysis
analyze
async
¶
analyze(results: Union[List[Dict[str, Any]], List[Any]], evaluation_criteria: str) -> MisalignmentAnalysis
Analyze misalignment patterns asynchronously.
Parameters:
-
results(Union[List[Dict[str, Any]], List[Any]]) –List of evaluation results with human_score, llm_score, etc.
-
evaluation_criteria(str) –The current evaluation criteria being used
Returns:
-
MisalignmentAnalysis–MisalignmentAnalysis with patterns, summary, and recommendations
MisalignmentAnalysis¶
axion.caliber.MisalignmentAnalysis
dataclass
¶
MisalignmentAnalysis(total_misaligned: int, false_positives: int, false_negatives: int, patterns: List[MisalignmentPattern], summary: str, recommendations: List[str], metadata: Dict[str, Any] = dict())
Complete result from misalignment analysis.
MisalignmentPattern¶
axion.caliber.MisalignmentPattern
dataclass
¶
A discovered pattern in misalignment analysis.
Prompt Optimization¶
PromptOptimizer¶
axion.caliber.PromptOptimizer ¶
PromptOptimizer(model_name: Optional[str] = None, llm: Optional[LLMRunnable] = None, llm_provider: Optional[str] = None, instruction: Optional[str] = None, max_examples: int = 10)
Optimizes evaluation prompts based on misalignment analysis.
This class leverages LLMHandler for structured output, automatic retries, and consistent LLM configuration with the rest of axion.
Example
from axion.caliber import PromptOptimizer
results = [ ... {'record_id': 'r1', 'human_score': 1, 'llm_score': 0, ... 'query': '...', 'actual_output': '...', 'llm_reasoning': '...'}, ... ] criteria = "Evaluate whether the response is accurate." system_prompt = "You are an evaluator..."
optimizer = PromptOptimizer(model_name='gpt-4o', llm_provider='openai') optimized = await optimizer.optimize(results, criteria, system_prompt)
print(f'Optimized criteria: {optimized.optimized_criteria}')
Initialize PromptOptimizer.
Parameters:
-
model_name(Optional[str], default:None) –Name of the LLM model (e.g., 'gpt-4o', 'claude-sonnet-4-20250514')
-
llm(Optional[LLMRunnable], default:None) –Pre-configured LLM instance
-
llm_provider(Optional[str], default:None) –LLM provider ('openai', 'anthropic')
-
instruction(Optional[str], default:None) –Custom instruction to override default optimization prompt
-
max_examples(int, default:10) –Max examples per category to include in optimization
optimize
async
¶
optimize(results: Union[List[Dict[str, Any]], List[Any]], current_criteria: str, system_prompt: str = '') -> OptimizedPrompt
Optimize evaluation criteria asynchronously.
Parameters:
-
results(Union[List[Dict[str, Any]], List[Any]]) –List of evaluation results with human_score, llm_score, etc.
-
current_criteria(str) –The current evaluation criteria to improve
-
system_prompt(str, default:'') –The current system prompt (optional)
Returns:
-
OptimizedPrompt–OptimizedPrompt with improved criteria and suggestions
OptimizedPrompt¶
axion.caliber.OptimizedPrompt
dataclass
¶
OptimizedPrompt(original_criteria: str, optimized_criteria: str, suggestions: List[PromptSuggestion], expected_improvement: str, metadata: Dict[str, Any] = dict())
Complete result from prompt optimization.
PromptSuggestion¶
axion.caliber.PromptSuggestion
dataclass
¶
A single suggestion for prompt improvement.
Renderers¶
CaliberRenderer¶
axion.caliber.CaliberRenderer ¶
Bases: ABC
Abstract renderer interface for CaliberHQ.
Implementations provide UI rendering for different environments: - NotebookCaliberRenderer: Jupyter notebooks with rich styling - ConsoleCaliberRenderer: Terminal/CLI output - JsonCaliberRenderer: JSON-only (for web APIs)
render_record
abstractmethod
¶
Render a single record for annotation.
Parameters:
-
record(UploadedRecord) –The record to display
-
annotation(Optional[Annotation], default:None) –Existing annotation if any
render_annotation_progress
abstractmethod
¶
Render annotation progress.
Parameters:
-
state(AnnotationState) –Current annotation state
render_evaluation_result
abstractmethod
¶
Render evaluation results.
Parameters:
-
result(EvaluationResult) –Evaluation result to display
render_misalignment_analysis
abstractmethod
¶
Render misalignment analysis.
Parameters:
-
analysis('MisalignmentAnalysis') –Analysis result to display
NotebookCaliberRenderer¶
axion.caliber.NotebookCaliberRenderer ¶
Bases: CaliberRenderer
Notebook-specific renderer for CaliberHQ.
Provides rich HTML and styled pandas output for Jupyter notebooks.
ConsoleCaliberRenderer¶
axion.caliber.ConsoleCaliberRenderer ¶
Bases: CaliberRenderer
Console/terminal renderer for CaliberHQ.
Provides text-based output suitable for CLI environments.
JsonCaliberRenderer¶
axion.caliber.JsonCaliberRenderer ¶
Bases: CaliberRenderer
JSON-first renderer for CaliberHQ.
Provides no-op rendering for API environments where data is consumed as JSON rather than displayed.