CaliberHQ

Align LLM judges with human judgment through calibration, pattern discovery, and iterative prompt optimization.

Why Use Calibration?

LLM-as-judge evaluation is only as reliable as the alignment between machine and human judgment. CaliberHQ gives you a structured workflow to measure that alignment, discover where your judge fails, and optimize its evaluation prompts — all without writing code.

🎯 Alignment Measurement

Cohen's Kappa, F1 score, precision, recall, and specificity — quantify exactly how well your LLM judge matches human labels.

🔍 Pattern Discovery

AI-powered clustering surfaces systematic failure patterns from your annotation notes — LLM, BERTopic, or hybrid methods.

⚙️ Prompt Optimization

Auto-generate improved evaluation criteria based on misalignment analysis. Apply and re-evaluate in a single click.

📊 Confusion Matrix

Visual breakdown of true positives, false positives, true negatives, and false negatives between human and LLM decisions.

Quick Start

Calibrate your LLM judge in three steps:

1

Upload Ground Truth Data

Upload a CSV with id, query, and actual_output columns. If your data already contains score or human_score columns, CaliberHQ will auto-detect them and offer to import the existing labels.

2

Review & Label

Walk through each record and label it Accept or Reject using the annotation card. Add notes to explain your reasoning — these power pattern discovery in step 3. Use keyboard shortcuts (A accept, R reject, arrow keys to navigate) for rapid labeling.

3

Build Eval

Configure your LLM judge (model, prompt, few-shot examples), run the evaluation, and compare the judge's decisions against your human labels. View alignment metrics, confusion matrix, comparison table, and AI-powered misalignment insights.

💡 Tip
If you already have labeled data from a previous annotation session, upload a CSV with a score column (0/1 values). CaliberHQ detects pre-labeled data automatically and lets you skip straight to the Build Eval step.

Page Anatomy

CaliberHQ uses a three-step wizard. Here is the full page layout with each major section labeled:

localhost:3500/caliber-hq

CaliberHQ

Align LLM judges with human judgment through calibration and pattern discovery

Configure
Export
Start Over
1
Upload
Review & Label
3 Build Eval
2

Pattern Insights

3 patterns · 12 notes
LLM BERTopic Hybrid
Incomplete Information 5

“Response only covers one option but question asks for all...”

✨ Use as criteria →
Hallucinated Facts 4

“Mentions a feature that doesn't exist in the product...”

Tone Mismatch 3

“Response is too casual for a professional support query...”

LLM Judge Configuration

Model Prompt Examples
SYSTEM PROMPT
You are an expert evaluator...
{evaluation_criteria}
Score: [0 or 1]
Reasoning: [explanation]
EVALUATION CRITERIA
Evaluate whether the response:
1. Directly addresses the user's query
2. Provides accurate and factual information...
3
Run Evaluation Model: gpt-4o Records: 25
▶ Run Evaluation
4

Evaluation Results

Export
Comparison Metrics Insights
Results appear here after running evaluation...
5
CaliberHQ — Build Eval step showing pattern insights, judge configuration, run evaluation bar, and results panel.
1
Page Header & Actions — Title, subtitle, and context-sensitive action buttons (Configure, Export, Start Over) that appear once data is loaded.
2
Step Navigation — Three-step wizard: Upload, Review & Label, Build Eval. Completed steps show a green checkmark. You can click back to any completed step.
3
Pattern Insights & Judge Config — Two-column layout. Left panel shows AI-discovered error patterns from your annotation notes. Right panel configures the LLM judge (model, system prompt, evaluation criteria, few-shot examples).
4
Run Evaluation — Trigger the LLM judge on all annotated records. Shows current model and record count. Displays a progress bar while running.
5
Evaluation Results — Three tabs: Comparison (side-by-side human vs. LLM decisions), Metrics (alignment stats + confusion matrix), and Insights (misalignment analysis + prompt optimization).

Step 1: Upload Ground Truth

The Upload step accepts CSV files with query-response pairs that you will annotate to create ground truth for calibrating your LLM judge.

localhost:3500/caliber-hq — Upload Step
1 Upload
2 Review & Label
3 Build Eval

Upload Ground Truth Data

Upload data with query-response pairs that you'll annotate to calibrate your LLM judge.

Required CSV columns:

  • id - Unique identifier for each record
  • query - The user query or input
  • actual_output - The AI-generated response to evaluate

Optional: expected_output for reference comparison

Drop your CSV file here, or click to browse
Supports .csv, .json, .jsonl formats

Pre-labeled Data Detected

Found 18 records with existing labels (12 accepted, 6 rejected) including notes.

✓ Use Existing Labels Re-annotate from Scratch
Upload step with file drop zone and pre-labeled data auto-detection. CaliberHQ can import existing labels so you skip manual annotation.

CSV Format

The minimum required columns are:

id,query,actual_output
rec-001,"What are the refund policies?","We offer a 30-day refund policy for all purchases..."
rec-002,"How do I reset my password?","You can reset your password by clicking..."

Optionally include pre-existing labels:

id,query,actual_output,score,notes
rec-001,"What are the refund policies?","We offer...",1,"Accurate and complete"
rec-002,"How do I reset my password?","Click the button...",0,"Missing 2FA instructions"
ℹ️ Info
CaliberHQ auto-detects score, label, or human_score columns with binary 0/1 values. It also recognizes notes, reason, or reasoning columns for importing annotation notes.

Step 2: Review & Label

The annotation interface presents each record one at a time. Your job: decide whether the AI response passes or fails your quality bar, and optionally explain why.

localhost:3500/caliber-hq — Review & Label
Upload
2 Review & Label
3 Build Eval
Record 7 of 25
rec-007
User Query
What are all the integration options available for enterprise customers, including setup requirements?
AI Response To Evaluate
We support Slack integration. You can set it up in the settings page under "Integrations."
Your Assessment
Accept
Meets expectations
A
Reject
Needs improvement
R
Notes optional — powers pattern discovery
What did you observe? e.g., "Missing billing tier breakdown..."
A
Annotation Progress
7 of 25 28%
Record Navigator
7
✓ 4 accepted ✗ 2 rejected
B
Review & Label step showing the annotation card with query, AI response, accept/reject buttons, and the progress sidebar with record navigator.
A
Annotation Card — Shows the query and AI response. Use the green Accept / red Reject buttons to label each record. The Notes field is optional but powers pattern discovery in Step 3.
B
Progress Sidebar — Sticky sidebar showing overall progress, a visual record navigator (green = accepted, red = rejected, highlighted = current), and counts. Click any square to jump to that record.

Keyboard Shortcuts

CaliberHQ supports rapid annotation with keyboard shortcuts:

KeyAction
AAccept the current record (score = 1) and auto-advance
RReject the current record (score = 0) and auto-advance
Navigate to previous record
Navigate to next record
NFocus the notes text field
📝 Note
Keyboard shortcuts are disabled when you are typing in the notes field. Press Escape or click outside the field to re-enable them.

Step 3: Build Eval

The Build Eval step is where calibration happens. It has three sections: Pattern Insights + Judge Configuration (side by side), Run Evaluation, and Evaluation Results.

Pattern Insights

If you added notes during annotation, CaliberHQ can cluster them into recurring error patterns using AI. Three clustering methods are available:

MethodHow It WorksBest For
LLM Uses your configured LLM to semantically cluster notes Small datasets, interpretable categories
BERTopic Embedding-based topic modeling (runs server-side) Large datasets, discovering hidden structure
Hybrid BERTopic clusters + LLM-generated category names Best of both — structure + interpretability

Each discovered pattern shows a category name, example count, and a representative note excerpt. Click "Use as criteria" on any pattern to inject it directly into your evaluation criteria prompt.

💡 Tip
Detailed annotation notes produce better patterns. Instead of "bad response," write "Response only mentions Slack but ignores email, webhook, and API integrations." The more specific, the more useful the discovered patterns.

LLM Judge Configuration

The right panel contains three tabs for configuring how the LLM judge evaluates records:

Model Tab

Select the LLM provider and model. Supported providers include OpenAI, Anthropic, and others configured in your AXIS settings. The provider must have an API key set in Settings for the run button to activate.

Prompt Tab

Edit the system prompt and evaluation criteria that the LLM judge uses. The default prompt includes a template variable {evaluation_criteria} that gets replaced with the criteria text. The prompt instructs the judge to output a binary Score (0 or 1) plus reasoning.

Examples Tab

Add few-shot examples to guide the LLM judge. Each example includes a query, response, expected score, and reasoning. More examples generally improve consistency but increase token usage.

Running the Evaluation

The "Run Evaluation" bar shows the current model and record count. Click to start:

  1. The LLM evaluates each annotated record using your configured prompt
  2. A progress bar shows completion percentage in real-time
  3. Results appear in the panel below once the run completes
⚠️ Warning
Running the evaluation requires a valid API key for the selected provider. Check Settings → API Keys if the Run button is disabled. Each evaluation run costs API tokens proportional to the number of annotated records.

Comparison Table

The default results tab. Shows every record with both the human label and LLM label side by side, so you can see exactly where they agree and disagree.

Evaluation Results — Comparison Tab
All (25) Aligned (20) Misaligned (5)
Export JSON
ID Query Human LLM Status
rec-001... What are the refund policies for... ✓ Accept ✓ Accept Aligned
rec-002... How do I configure SSO for my... ✗ Reject ✓ Accept Misaligned
rec-003... Can you explain the billing tiers... ✗ Reject ✗ Reject Aligned
rec-007... What are all the integration options... ✗ Reject ✓ Accept Misaligned
rec-012... What is the SLA for enterprise... ✓ Accept ✓ Accept Aligned
Showing 1-10 of 25
Previous Next
Comparison table with filter pills (All / Aligned / Misaligned), human vs. LLM decisions, and expandable row detail showing the response and LLM reasoning.
  • Filter pills — quickly isolate aligned or misaligned results
  • Misaligned rows have a subtle red background tint
  • Click the chevron on any row to expand and see the full AI response and LLM reasoning side by side
  • Export JSON exports the filtered results for external analysis

Alignment Metrics

The Metrics tab provides a comprehensive statistical view of how well your LLM judge aligns with human judgment.

Evaluation Results — Metrics Tab
Cohen's Kappa
0.680
Substantial agreement
F1 Score
84.0%
Balanced accuracy
Alignment
88.0%
Overall agreement rate
Samples
25
Total evaluated
Substantial Agreement
Cohen's Kappa of 0.680 indicates substantial agreement between human and LLM judgments.
Detailed Metrics
Precision
86.7%
Recall
81.3%
Specificity
90.0%
Agreement Count
22
Confusion Matrix
LLM: Accept LLM: Reject
Human: Accept 13 3
Human: Reject 2 7
13 True Positives — Both accepted
7 True Negatives — Both rejected
2 False Positives — LLM accepted, human rejected
3 False Negatives — LLM rejected, human accepted

The LLM judge tends to be stricter than human reviewers. Consider relaxing the evaluation criteria. Overall accuracy: 80.0%.

Metrics tab showing primary KPIs (Cohen's Kappa, F1, Alignment), detailed secondary metrics (Precision, Recall, Specificity), and the confusion matrix with interpretation.

Metric Definitions

MetricFormulaWhat It Tells You
Cohen's Kappa (P_o - P_e) / (1 - P_e) Agreement beyond chance. Values: <0.2 slight, 0.2–0.4 fair, 0.4–0.6 moderate, 0.6–0.8 substantial, >0.8 almost perfect.
F1 Score 2 * (P * R) / (P + R) Harmonic mean of precision and recall. A balanced measure of overall accuracy.
Precision TP / (TP + FP) Of all LLM accepts, how many were actually good? Low precision = judge is too lenient.
Recall TP / (TP + FN) Of all human accepts, how many did the LLM find? Low recall = judge is too strict.
Specificity TN / (TN + FP) Of all human rejects, how many did the LLM correctly reject?

Confusion Matrix

The 2x2 confusion matrix visualizes the four possible outcomes of each comparison:

  • True Positives (TP) — Both human and LLM accepted (green cells)
  • True Negatives (TN) — Both human and LLM rejected (green cells)
  • False Positives (FP) — LLM accepted but human rejected. The judge is too lenient here.
  • False Negatives (FN) — LLM rejected but human accepted. The judge is too strict here.

The interpretation text below the matrix automatically tells you whether your judge tends to be too lenient (FP > FN) or too strict (FN > FP), guiding your next prompt iteration.

Insights & Prompt Optimization

The Insights tab provides AI-powered analysis of misalignment patterns and automatic prompt optimization to improve judge accuracy.

Evaluation Results — Insights Tab
Misalignment Analysis
Analyze Patterns
Summary

The LLM judge shows a tendency toward false positives, accepting responses that humans reject. This is primarily driven by incomplete coverage checks — the judge fails to verify that all aspects of a question are addressed.

5
Total Misaligned
3
Too Lenient (FP)
2
Too Strict (FN)
Recommendations
1 Add explicit criteria requiring all sub-questions to be addressed.
2 Include few-shot examples of partial answers that should be rejected.
3 Consider a stricter system prompt that penalizes incomplete responses.
Prompt Optimization
Generate Optimized Prompt
Expected Improvement

Estimated 10–15% improvement in alignment by addressing completeness verification and adding stricter sub-question coverage requirements.

View Optimized Criteria Apply & Re-run →
Insights tab showing misalignment analysis (summary, FP/FN breakdown, recommendations) and prompt optimization with one-click apply.

The iterative calibration loop:

  1. Analyze Patterns — get an AI summary of why misalignment occurs
  2. Review Recommendations — actionable suggestions to improve your prompt
  3. Generate Optimized Prompt — AI rewrites your evaluation criteria
  4. Apply & Re-run — one-click applies the new criteria and re-evaluates
  5. Repeat until Kappa reaches your target level
💡 Tip
A Cohen's Kappa above 0.6 (substantial agreement) is generally considered sufficient for production use. Above 0.8 is considered almost perfect. If your Kappa is below 0.4, focus on the false positive vs. false negative balance to decide whether to tighten or relax your criteria.

The Calibration Workflow

CaliberHQ is designed for iterative improvement. Here is the recommended workflow:

1

Establish Baseline

Upload your data, annotate a representative sample (at least 20–30 records), and run the first evaluation with default prompts. Record the baseline Kappa.

2

Discover Patterns

Use Pattern Insights to identify systematic failure modes. Convert the most impactful patterns into evaluation criteria using the "Use as criteria" action.

3

Add Few-Shot Examples

For tricky edge cases, add few-shot examples showing the expected judgment. Prioritize examples from the misaligned set.

4

Iterate & Re-run

Re-run the evaluation after each change. Track how Kappa, F1, and the confusion matrix evolve. Use the Insights tab for AI-driven prompt optimization when you plateau.

5

Export & Deploy

Once satisfied with alignment, export the calibrated judge configuration (system prompt + evaluation criteria + few-shot examples) for use in your production evaluation pipeline.

Configuration Reference

CaliberHQ can be fine-tuned through the Configure modal (gear icon in the header) and the Build Eval step panels.

Column Configuration

Accessible from the gear icon on the annotation card. Lets you control:

  • ID Column — which column uniquely identifies each record (default: id)
  • Display Columns — which columns appear in the annotation card (default: query, actual_output)
  • LLM Columns — which columns are sent to the LLM judge for evaluation

Judge Config Persistence

CaliberHQ persists the following to localStorage so you don't lose work between sessions:

  • Human annotations (score + notes per record ID)
  • Judge configuration (model, provider, system prompt, evaluation criteria, few-shot examples)
  • Column configuration and file metadata
⚠️ Warning
Raw data (the CSV records themselves) is not persisted to localStorage due to size constraints. If you close the tab and reopen it, you'll see your previous annotations and config but will need to re-upload the data file. Use "Start Over" to clear all persisted state.

Next Steps

AXIS Documentation · Built with MkDocs Material