CaliberHQ¶

CaliberHQ is a toolkit for calibrating LLM-as-a-judge evaluators against a human-labeled baseline. Instead of writing rubrics in the abstract, you work backward from real outputs so the evaluator learns what actually matters for your use case.

Quick start¶

from axion.caliber import CalibrationSession

session = CalibrationSession()

# 1) Upload your data
session.upload_records(
    [
        {"id": "r1", "query": "…", "actual_output": "…"},
        {"id": "r2", "query": "…", "actual_output": "…"},
    ]
)

# 2) Add human annotations
session.annotate("r1", score=1, notes="Good response")
session.annotate("r2", score=0, notes="Factually incorrect")

# 3) Run LLM evaluation + alignment metrics
result = await session.evaluate(
    criteria="Score 1 if accurate and helpful, 0 otherwise",
    model_name="gpt-4o",
    llm_provider="openai",
)

print(f"Accuracy: {result.metrics.accuracy:.1%}")
print(f"Cohen's Kappa: {result.metrics.cohen_kappa:.3f}")

The 6-step workflow¶

1

Upload

Load your evaluation data from CSV or programmatically.

2

Annotate

Add human judgments (Accept=1, Reject=0) with optional notes.

3

Evaluate

Run the LLM judge and compute alignment metrics (accuracy, kappa, F1).

4

Discover Patterns

Cluster any text evidence into themes and distill learning artifacts via a full pipeline.

5

Analyze Misalignments

Examine false positives and false negatives (judge vs human).

6

Optimize

Produce improved evaluation criteria based on misalignment patterns.

graph LR
    U["Upload"] --> A["Annotate"]
    A --> E["Evaluate"]
    E --> D["Discover Patterns"]
    D --> M["Analyze Misalignments"]
    M --> O["Optimize Prompt"]
    O -->|"Iterate"| E

Usage Patterns¶

Session-Based (Recommended) Direct Components (Advanced) UI / Web Pipeline

Use CalibrationSession for state management and serialization. Good for scripts, web APIs, and notebooks.

from axion.caliber import CalibrationSession

session = CalibrationSession()
session.upload_csv("data.csv")
session.annotate("r1", score=1, notes="Good")
result = await session.evaluate(criteria="…")

Use individual components for fine-grained control over each step.

from axion.caliber import AnnotationManager, EvaluationRunner, UploadHandler

upload = UploadHandler().from_csv("data.csv")
manager = AnnotationManager(upload.records)
manager.annotate("r1", score=1, notes="Good")

# Run evaluation with your own config (see `EvaluationConfig`)
runner = EvaluationRunner()
result = await runner.run(upload.records, manager.get_annotations_dict())

Use WebCaliberHQ for JSON-serializable results with optional progress callbacks — ideal for web or UI integrations.

from axion.align import WebCaliberHQ

web_eval = WebCaliberHQ(dataset, PassFailMetric())
payload = web_eval.execute(
    as_dict=True,
    on_progress=lambda current, total: print(current, total),
)

# payload["results"] -> list of row dicts
# payload["metrics"] -> summary metrics
# payload["confusion_matrix"] -> confusion matrix dict

You can also construct a dataset from uploaded records:

web_eval = WebCaliberHQ.from_records(records, PassFailMetric())
payload = web_eval.execute(as_dict=True)

Key Components¶

Category	Imports
Core session	`CalibrationSession`
Step components	`UploadHandler`, `AnnotationManager`, `EvaluationRunner`
Analysis tools	`PatternDiscovery`, `EvidencePipeline`, `MisalignmentAnalyzer`, `PromptOptimizer`, `ExampleSelector`
Renderers	`ConsoleCaliberRenderer`, `NotebookCaliberRenderer`, `JsonCaliberRenderer`

Demo¶

Run the demo script to see the full workflow in action:

# Basic demo (no API key needed)
python examples/caliber_demo.py

# Full end-to-end with LLM calls
OPENAI_API_KEY=your-key python examples/caliber_demo.py --full

Workflow Detail¶

The CaliberHQ workflow progresses through three main phases:

graph TD
    subgraph "Phase 1: Data Preparation"
        U["Upload CSV"] --> V["Validate Columns"]
        V --> S["Load into Store"]
    end
    subgraph "Phase 2: Human Review"
        S --> R["Review Records"]
        R --> J["Accept / Reject"]
        J --> N["Add Notes"]
    end
    subgraph "Phase 3: Evaluation & Alignment"
        N --> E["Run LLM Judge"]
        E --> M["Compute Alignment Metrics"]
        M --> MA["Misalignment Analysis"]
        MA --> PO["Prompt Optimization"]
        PO --> PI["Pattern Insights"]
        PI -->|"Iterate"| E
    end

Alignment Metrics¶

After evaluation, CaliberHQ computes alignment between the LLM judge and human annotations:

Metric	What It Measures
Accuracy	How often the LLM agrees with humans
Precision	How often LLM "Accept" is correct
Recall	How many human "Accept" cases the LLM catches
F1 Score	Harmonic mean of precision and recall
Cohen's Kappa	Agreement beyond chance (>0.6 = substantial)

Misalignment Analysis¶

The system identifies two types of disagreement:

FP

False Positives

LLM accepts, human rejects. The judge is too lenient — missing quality issues humans catch.

FN

False Negatives

LLM rejects, human accepts. The judge is too strict — penalizing acceptable responses.

Feedback Loop¶

graph LR
    O["Optimized Prompt"] --> R["Re-run Evaluation"]
    R --> C["Compare Metrics"]
    C -->|"Aligned"| D["Done"]
    C -->|"Not aligned"| O

Defining a Metric¶

Define an LLM-as-a-judge metric with a clear instruction. The metric can be as simple as a binary pass/fail rubric.

Simple Rubric With CaliberHQ

from axion.metrics.base import BaseMetric

class PassFailMetric(BaseMetric):
    instruction = (
        "Score 1 if the answer is correct and complete. "
        "Otherwise score 0 and explain why."
    )

from axion.align import CaliberHQ
from axion.dataset import Dataset, DatasetItem

items = [
    DatasetItem(
        id="item-1",
        query="What is the capital of France?",
        expected_output="Paris",
        actual_output="Paris.",
    ),
    DatasetItem(
        id="item-2",
        query="What is 2+2?",
        expected_output="4",
        actual_output="5",
    ),
]

dataset = Dataset(items=items)
evaluator = CaliberHQ(dataset, PassFailMetric())
evaluator.annotate()  # optional if judgments are already present
results_df = evaluator.execute()

Intelligent Example Selection¶

When providing few-shot examples to calibrate your LLM judge, use ExampleSelector for smarter selection instead of naive slicing.

Balanced Misalignment-Guided Pattern-Aware

Default — 50/50 accept/reject sampling for unbiased baselines.

from axion.align import ExampleSelector, SelectionStrategy

selector = ExampleSelector(seed=42)
result = selector.select(records, annotations, count=6)

Prioritize false positive and false negative cases from a prior evaluation run.

result = selector.select(
    records, annotations, count=6,
    strategy=SelectionStrategy.MISALIGNMENT_GUIDED,
    eval_results=prior_results
)

Cover discovered failure patterns to ensure all categories are represented.

result = selector.select(
    records, annotations, count=6,
    strategy=SelectionStrategy.PATTERN_AWARE,
    patterns=discovered_patterns
)

Strategy	Use Case
`BALANCED`	Default — 50/50 accept/reject sampling
`MISALIGNMENT_GUIDED`	Prioritize FP/FN cases from prior eval
`PATTERN_AWARE`	Cover discovered failure patterns

See Example Selector Deep Dive for detailed usage.

Renderers¶

CaliberHQ uses a renderer interface so UIs can plug in without changing core logic.

Notebook Console JSON

from axion.align import CaliberHQ, NotebookCaliberHQRenderer

evaluator = CaliberHQ(dataset, PassFailMetric(), renderer=NotebookCaliberHQRenderer())
evaluator.execute()

from axion.align import CaliberHQ, ConsoleCaliberHQRenderer

evaluator = CaliberHQ(dataset, PassFailMetric(), renderer=ConsoleCaliberHQRenderer())
evaluator.execute()

from axion.align import CaliberHQ, JsonCaliberHQRenderer

evaluator = CaliberHQ(dataset, PassFailMetric(), renderer=JsonCaliberHQRenderer())
evaluator.execute()

Caliber API Reference Pattern Discovery Deep Dive Example Selector Deep Dive Metrics Guide