CaliberHQ

Align LLM judges with human judgment through calibration, pattern discovery, and iterative prompt optimization.

Why Use Calibration?

LLM-as-judge evaluation is only as reliable as the alignment between machine and human judgment. CaliberHQ gives you a structured workflow to measure that alignment, discover where your judge fails, and optimize its evaluation prompts — all without writing code.

🎯 Alignment Measurement

Cohen's Kappa, F1 score, precision, recall, and specificity — quantify exactly how well your LLM judge matches human labels.

🔍 Pattern Discovery

AI-powered clustering surfaces systematic failure patterns from your annotation notes — LLM, BERTopic, or hybrid methods.

⚙️ Prompt Optimization

Auto-generate improved evaluation criteria based on misalignment analysis. Apply and re-evaluate in a single click.

📊 Confusion Matrix

Visual breakdown of true positives, false positives, true negatives, and false negatives between human and LLM decisions.

Quick Start

Calibrate your LLM judge in three steps:

Upload Ground Truth Data

Upload a CSV with id, query, and actual_output columns. If your data already contains score or human_score columns, CaliberHQ will auto-detect them and offer to import the existing labels.

Review & Label

Walk through each record and label it Accept or Reject using the annotation card. Add notes to explain your reasoning — these power pattern discovery in step 3. Use keyboard shortcuts (A accept, R reject, arrow keys to navigate) for rapid labeling.

Build Eval

Configure your LLM judge (model, prompt, few-shot examples), run the evaluation, and compare the judge's decisions against your human labels. View alignment metrics, confusion matrix, comparison table, and AI-powered misalignment insights.

💡 Tip

If you already have labeled data from a previous annotation session, upload a CSV with a score column (0/1 values). CaliberHQ detects pre-labeled data automatically and lets you skip straight to the Build Eval step.

Page Anatomy

CaliberHQ uses a three-step wizard. Here is the full page layout with each major section labeled:

localhost:3500/caliber-hq

CaliberHQ

Align LLM judges with human judgment through calibration and pattern discovery

Configure

Export

Start Over

✓ Upload

✓ Review & Label

3 Build Eval

Pattern Insights

3 patterns · 12 notes

LLM BERTopic Hybrid

Incomplete Information 5

“Response only covers one option but question asks for all...”

✨ Use as criteria →

Hallucinated Facts 4

“Mentions a feature that doesn't exist in the product...”

Tone Mismatch 3

“Response is too casual for a professional support query...”

LLM Judge Configuration

Model Prompt Examples

SYSTEM PROMPT

You are an expert evaluator...
{evaluation_criteria}
Score: [0 or 1]
Reasoning: [explanation]

EVALUATION CRITERIA

Evaluate whether the response:
1. Directly addresses the user's query
2. Provides accurate and factual information...

Run Evaluation Model: gpt-4o Records: 25

▶ Run Evaluation

Evaluation Results

Export

Comparison Metrics Insights

Results appear here after running evaluation...

CaliberHQ — Build Eval step showing pattern insights, judge configuration, run evaluation bar, and results panel.

Page Header & Actions — Title, subtitle, and context-sensitive action buttons (Configure, Export, Start Over) that appear once data is loaded.

Step Navigation — Three-step wizard: Upload, Review & Label, Build Eval. Completed steps show a green checkmark. You can click back to any completed step.

Pattern Insights & Judge Config — Two-column layout. Left panel shows AI-discovered error patterns from your annotation notes. Right panel configures the LLM judge (model, system prompt, evaluation criteria, few-shot examples).

Run Evaluation — Trigger the LLM judge on all annotated records. Shows current model and record count. Displays a progress bar while running.

Evaluation Results — Three tabs: Comparison (side-by-side human vs. LLM decisions), Metrics (alignment stats + confusion matrix), and Insights (misalignment analysis + prompt optimization).

Step 1: Upload Ground Truth

The Upload step accepts CSV files with query-response pairs that you will annotate to create ground truth for calibrating your LLM judge.

localhost:3500/caliber-hq — Upload Step

1 Upload

2 Review & Label

3 Build Eval

Upload Ground Truth Data

Upload data with query-response pairs that you'll annotate to calibrate your LLM judge.

Required CSV columns:

id - Unique identifier for each record
query - The user query or input
actual_output - The AI-generated response to evaluate

Optional: expected_output for reference comparison

Drop your CSV file here, or click to browse

Supports .csv, .json, .jsonl formats

Pre-labeled Data Detected

Found 18 records with existing labels (12 accepted, 6 rejected) including notes.

✓ Use Existing Labels Re-annotate from Scratch

Upload step with file drop zone and pre-labeled data auto-detection. CaliberHQ can import existing labels so you skip manual annotation.

CSV Format

The minimum required columns are:

id,query,actual_output
rec-001,"What are the refund policies?","We offer a 30-day refund policy for all purchases..."
rec-002,"How do I reset my password?","You can reset your password by clicking..."

Optionally include pre-existing labels:

id,query,actual_output,score,notes
rec-001,"What are the refund policies?","We offer...",1,"Accurate and complete"
rec-002,"How do I reset my password?","Click the button...",0,"Missing 2FA instructions"

ℹ️ Info

CaliberHQ auto-detects score, label, or human_score columns with binary 0/1 values. It also recognizes notes, reason, or reasoning columns for importing annotation notes.

Step 2: Review & Label

The annotation interface presents each record one at a time. Your job: decide whether the AI response passes or fails your quality bar, and optionally explain why.

localhost:3500/caliber-hq — Review & Label

✓ Upload

2 Review & Label

3 Build Eval

‹ ›

Record 7 of 25

rec-007

User Query

What are all the integration options available for enterprise customers, including setup requirements?

AI Response To Evaluate

We support Slack integration. You can set it up in the settings page under "Integrations."

Your Assessment

✓ Accept
Meets expectations A

✗ Reject
Needs improvement R

Notes optional — powers pattern discovery

What did you observe? e.g., "Missing billing tier breakdown..."

Review & Label step showing the annotation card with query, AI response, accept/reject buttons, and the progress sidebar with record navigator.

Annotation Card — Shows the query and AI response. Use the green Accept / red Reject buttons to label each record. The Notes field is optional but powers pattern discovery in Step 3.

Progress Sidebar — Sticky sidebar showing overall progress, a visual record navigator (green = accepted, red = rejected, highlighted = current), and counts. Click any square to jump to that record.

Keyboard Shortcuts

CaliberHQ supports rapid annotation with keyboard shortcuts:

Key	Action
`A`	Accept the current record (score = 1) and auto-advance
`R`	Reject the current record (score = 0) and auto-advance
`←`	Navigate to previous record
`→`	Navigate to next record
`N`	Focus the notes text field

📝 Note

Keyboard shortcuts are disabled when you are typing in the notes field. Press Escape or click outside the field to re-enable them.

Step 3: Build Eval

The Build Eval step is where calibration happens. It has three sections: Pattern Insights + Judge Configuration (side by side), Run Evaluation, and Evaluation Results.

Pattern Insights

If you added notes during annotation, CaliberHQ can cluster them into recurring error patterns using AI. Three clustering methods are available:

Method	How It Works	Best For
LLM	Uses your configured LLM to semantically cluster notes	Small datasets, interpretable categories
BERTopic	Embedding-based topic modeling (runs server-side)	Large datasets, discovering hidden structure
Hybrid	BERTopic clusters + LLM-generated category names	Best of both — structure + interpretability

Each discovered pattern shows a category name, example count, and a representative note excerpt. Click "Use as criteria" on any pattern to inject it directly into your evaluation criteria prompt.

💡 Tip

Detailed annotation notes produce better patterns. Instead of "bad response," write "Response only mentions Slack but ignores email, webhook, and API integrations." The more specific, the more useful the discovered patterns.

LLM Judge Configuration

The right panel contains three tabs for configuring how the LLM judge evaluates records:

Model Tab

Select the LLM provider and model. Supported providers include OpenAI, Anthropic, and others configured in your AXIS settings. The provider must have an API key set in Settings for the run button to activate.

Prompt Tab

Edit the system prompt and evaluation criteria that the LLM judge uses. The default prompt includes a template variable {evaluation_criteria} that gets replaced with the criteria text. The prompt instructs the judge to output a binary Score (0 or 1) plus reasoning.

Examples Tab

Add few-shot examples to guide the LLM judge. Each example includes a query, response, expected score, and reasoning. More examples generally improve consistency but increase token usage.

Running the Evaluation

The "Run Evaluation" bar shows the current model and record count. Click to start:

The LLM evaluates each annotated record using your configured prompt
A progress bar shows completion percentage in real-time
Results appear in the panel below once the run completes

⚠️ Warning

Running the evaluation requires a valid API key for the selected provider. Check Settings → API Keys if the Run button is disabled. Each evaluation run costs API tokens proportional to the number of annotated records.

Comparison Table

The default results tab. Shows every record with both the human label and LLM label side by side, so you can see exactly where they agree and disagree.

Evaluation Results — Comparison Tab

All (25) Aligned (20) Misaligned (5)

Export JSON

ID	Query	Human	LLM	Status
rec-001...	What are the refund policies for...	✓ Accept	✓ Accept	Aligned	▼
rec-002...	How do I configure SSO for my...	✗ Reject	✓ Accept	Misaligned	▼
rec-003...	Can you explain the billing tiers...	✗ Reject	✗ Reject	Aligned	▼
rec-007...	What are all the integration options...	✗ Reject	✓ Accept	Misaligned	▼
rec-012...	What is the SLA for enterprise...	✓ Accept	✓ Accept	Aligned	▼

Showing 1-10 of 25

Previous Next

Comparison table with filter pills (All / Aligned / Misaligned), human vs. LLM decisions, and expandable row detail showing the response and LLM reasoning.

Filter pills — quickly isolate aligned or misaligned results
Misaligned rows have a subtle red background tint
Click the chevron on any row to expand and see the full AI response and LLM reasoning side by side
Export JSON exports the filtered results for external analysis

Alignment Metrics

The Metrics tab provides a comprehensive statistical view of how well your LLM judge aligns with human judgment.

Evaluation Results — Metrics Tab

Cohen's Kappa

0.680

Substantial agreement

F1 Score

84.0%

Balanced accuracy

Alignment

88.0%

Overall agreement rate

Samples

Total evaluated

Substantial Agreement

Cohen's Kappa of 0.680 indicates substantial agreement between human and LLM judgments.

Detailed Metrics

Precision

86.7%

Recall

81.3%

Specificity

90.0%

Agreement Count

Confusion Matrix

	LLM: Accept	LLM: Reject
Human: Accept	13	3
Human: Reject	2	7

13 True Positives — Both accepted

7 True Negatives — Both rejected

2 False Positives — LLM accepted, human rejected

3 False Negatives — LLM rejected, human accepted

The LLM judge tends to be stricter than human reviewers. Consider relaxing the evaluation criteria. Overall accuracy: 80.0%.

Metrics tab showing primary KPIs (Cohen's Kappa, F1, Alignment), detailed secondary metrics (Precision, Recall, Specificity), and the confusion matrix with interpretation.

Metric Definitions

Metric	Formula	What It Tells You
Cohen's Kappa	`(P_o - P_e) / (1 - P_e)`	Agreement beyond chance. Values: <0.2 slight, 0.2–0.4 fair, 0.4–0.6 moderate, 0.6–0.8 substantial, >0.8 almost perfect.
F1 Score	`2 * (P * R) / (P + R)`	Harmonic mean of precision and recall. A balanced measure of overall accuracy.
Precision	`TP / (TP + FP)`	Of all LLM accepts, how many were actually good? Low precision = judge is too lenient.
Recall	`TP / (TP + FN)`	Of all human accepts, how many did the LLM find? Low recall = judge is too strict.
Specificity	`TN / (TN + FP)`	Of all human rejects, how many did the LLM correctly reject?

Confusion Matrix

The 2x2 confusion matrix visualizes the four possible outcomes of each comparison:

True Positives (TP) — Both human and LLM accepted (green cells)
True Negatives (TN) — Both human and LLM rejected (green cells)
False Positives (FP) — LLM accepted but human rejected. The judge is too lenient here.
False Negatives (FN) — LLM rejected but human accepted. The judge is too strict here.

The interpretation text below the matrix automatically tells you whether your judge tends to be too lenient (FP > FN) or too strict (FN > FP), guiding your next prompt iteration.

Insights & Prompt Optimization

The Insights tab provides AI-powered analysis of misalignment patterns and automatic prompt optimization to improve judge accuracy.

Evaluation Results — Insights Tab

Misalignment Analysis

Analyze Patterns

Summary

The LLM judge shows a tendency toward false positives, accepting responses that humans reject. This is primarily driven by incomplete coverage checks — the judge fails to verify that all aspects of a question are addressed.

Total Misaligned

Too Lenient (FP)

Too Strict (FN)

Recommendations

1 Add explicit criteria requiring all sub-questions to be addressed.

2 Include few-shot examples of partial answers that should be rejected.

3 Consider a stricter system prompt that penalizes incomplete responses.

Prompt Optimization

Generate Optimized Prompt

Expected Improvement

Estimated 10–15% improvement in alignment by addressing completeness verification and adding stricter sub-question coverage requirements.

View Optimized Criteria Apply & Re-run →

Insights tab showing misalignment analysis (summary, FP/FN breakdown, recommendations) and prompt optimization with one-click apply.

The iterative calibration loop:

Analyze Patterns — get an AI summary of why misalignment occurs
Review Recommendations — actionable suggestions to improve your prompt
Generate Optimized Prompt — AI rewrites your evaluation criteria
Apply & Re-run — one-click applies the new criteria and re-evaluates
Repeat until Kappa reaches your target level

💡 Tip

A Cohen's Kappa above 0.6 (substantial agreement) is generally considered sufficient for production use. Above 0.8 is considered almost perfect. If your Kappa is below 0.4, focus on the false positive vs. false negative balance to decide whether to tighten or relax your criteria.

The Calibration Workflow

CaliberHQ is designed for iterative improvement. Here is the recommended workflow:

Establish Baseline

Upload your data, annotate a representative sample (at least 20–30 records), and run the first evaluation with default prompts. Record the baseline Kappa.

Discover Patterns

Use Pattern Insights to identify systematic failure modes. Convert the most impactful patterns into evaluation criteria using the "Use as criteria" action.

Add Few-Shot Examples

For tricky edge cases, add few-shot examples showing the expected judgment. Prioritize examples from the misaligned set.

Iterate & Re-run

Re-run the evaluation after each change. Track how Kappa, F1, and the confusion matrix evolve. Use the Insights tab for AI-driven prompt optimization when you plateau.

Export & Deploy

Once satisfied with alignment, export the calibrated judge configuration (system prompt + evaluation criteria + few-shot examples) for use in your production evaluation pipeline.

Configuration Reference

CaliberHQ can be fine-tuned through the Configure modal (gear icon in the header) and the Build Eval step panels.

Column Configuration

Accessible from the gear icon on the annotation card. Lets you control:

ID Column — which column uniquely identifies each record (default: id)
Display Columns — which columns appear in the annotation card (default: query, actual_output)
LLM Columns — which columns are sent to the LLM judge for evaluation

Judge Config Persistence

CaliberHQ persists the following to localStorage so you don't lose work between sessions:

Human annotations (score + notes per record ID)
Judge configuration (model, provider, system prompt, evaluation criteria, few-shot examples)
Column configuration and file metadata

⚠️ Warning

Raw data (the CSV records themselves) is not persisted to localStorage due to size constraints. If you close the tab and reopen it, you'll see your previous annotations and config but will need to re-upload the data file. Use "Start Over" to clear all persisted state.

CaliberHQ

Why Use Calibration?

🎯 Alignment Measurement

🔍 Pattern Discovery

⚙️ Prompt Optimization

📊 Confusion Matrix

Quick Start

Upload Ground Truth Data

Review & Label

Build Eval

Page Anatomy

CaliberHQ

Pattern Insights

LLM Judge Configuration

Evaluation Results

Step 1: Upload Ground Truth

Upload Ground Truth Data

CSV Format

Step 2: Review & Label

Keyboard Shortcuts

Step 3: Build Eval

Pattern Insights

LLM Judge Configuration

Model Tab

Prompt Tab

Examples Tab

Running the Evaluation

Comparison Table

Alignment Metrics

Metric Definitions

Confusion Matrix

Insights & Prompt Optimization

The Calibration Workflow

Establish Baseline

Discover Patterns

Add Few-Shot Examples

Iterate & Re-run

Export & Deploy

Configuration Reference

Column Configuration

Judge Config Persistence

Next Steps

📊 Monitoring →

📡 Human Signals →

⚙ Configuration →