“Response only covers one option but question asks for all...”
✨ Use as criteria →“Mentions a feature that doesn't exist in the product...”
“Response is too casual for a professional support query...”
Align LLM judges with human judgment through calibration, pattern discovery, and iterative prompt optimization.
LLM-as-judge evaluation is only as reliable as the alignment between machine and human judgment. CaliberHQ gives you a structured workflow to measure that alignment, discover where your judge fails, and optimize its evaluation prompts — all without writing code.
Cohen's Kappa, F1 score, precision, recall, and specificity — quantify exactly how well your LLM judge matches human labels.
AI-powered clustering surfaces systematic failure patterns from your annotation notes — LLM, BERTopic, or hybrid methods.
Auto-generate improved evaluation criteria based on misalignment analysis. Apply and re-evaluate in a single click.
Visual breakdown of true positives, false positives, true negatives, and false negatives between human and LLM decisions.
Calibrate your LLM judge in three steps:
Upload a CSV with id, query, and actual_output columns. If your data already contains score or human_score columns, CaliberHQ will auto-detect them and offer to import the existing labels.
Walk through each record and label it Accept or Reject using the annotation card. Add notes to explain your reasoning — these power pattern discovery in step 3. Use keyboard shortcuts (A accept, R reject, arrow keys to navigate) for rapid labeling.
Configure your LLM judge (model, prompt, few-shot examples), run the evaluation, and compare the judge's decisions against your human labels. View alignment metrics, confusion matrix, comparison table, and AI-powered misalignment insights.
score column (0/1 values). CaliberHQ detects pre-labeled data automatically and lets you skip straight to the Build Eval step.
CaliberHQ uses a three-step wizard. Here is the full page layout with each major section labeled:
Align LLM judges with human judgment through calibration and pattern discovery
“Response only covers one option but question asks for all...”
✨ Use as criteria →“Mentions a feature that doesn't exist in the product...”
“Response is too casual for a professional support query...”
The Upload step accepts CSV files with query-response pairs that you will annotate to create ground truth for calibrating your LLM judge.
Upload data with query-response pairs that you'll annotate to calibrate your LLM judge.
Required CSV columns:
id - Unique identifier for each recordquery - The user query or inputactual_output - The AI-generated response to evaluateOptional: expected_output for reference comparison
Pre-labeled Data Detected
Found 18 records with existing labels (12 accepted, 6 rejected) including notes.
The minimum required columns are:
id,query,actual_output
rec-001,"What are the refund policies?","We offer a 30-day refund policy for all purchases..."
rec-002,"How do I reset my password?","You can reset your password by clicking..."
Optionally include pre-existing labels:
id,query,actual_output,score,notes
rec-001,"What are the refund policies?","We offer...",1,"Accurate and complete"
rec-002,"How do I reset my password?","Click the button...",0,"Missing 2FA instructions"
score, label, or human_score columns with binary 0/1 values. It also recognizes notes, reason, or reasoning columns for importing annotation notes.
The annotation interface presents each record one at a time. Your job: decide whether the AI response passes or fails your quality bar, and optionally explain why.
CaliberHQ supports rapid annotation with keyboard shortcuts:
| Key | Action |
|---|---|
| A | Accept the current record (score = 1) and auto-advance |
| R | Reject the current record (score = 0) and auto-advance |
| ← | Navigate to previous record |
| → | Navigate to next record |
| N | Focus the notes text field |
The Build Eval step is where calibration happens. It has three sections: Pattern Insights + Judge Configuration (side by side), Run Evaluation, and Evaluation Results.
If you added notes during annotation, CaliberHQ can cluster them into recurring error patterns using AI. Three clustering methods are available:
| Method | How It Works | Best For |
|---|---|---|
| LLM | Uses your configured LLM to semantically cluster notes | Small datasets, interpretable categories |
| BERTopic | Embedding-based topic modeling (runs server-side) | Large datasets, discovering hidden structure |
| Hybrid | BERTopic clusters + LLM-generated category names | Best of both — structure + interpretability |
Each discovered pattern shows a category name, example count, and a representative note excerpt. Click "Use as criteria" on any pattern to inject it directly into your evaluation criteria prompt.
The right panel contains three tabs for configuring how the LLM judge evaluates records:
Select the LLM provider and model. Supported providers include OpenAI, Anthropic, and others configured in your AXIS settings. The provider must have an API key set in Settings for the run button to activate.
Edit the system prompt and evaluation criteria that the LLM judge uses. The default prompt includes a template variable {evaluation_criteria} that gets replaced with the criteria text. The prompt instructs the judge to output a binary Score (0 or 1) plus reasoning.
Add few-shot examples to guide the LLM judge. Each example includes a query, response, expected score, and reasoning. More examples generally improve consistency but increase token usage.
The "Run Evaluation" bar shows the current model and record count. Click to start:
The default results tab. Shows every record with both the human label and LLM label side by side, so you can see exactly where they agree and disagree.
| ID | Query | Human | LLM | Status | |
|---|---|---|---|---|---|
| rec-001... | ✓ Accept | ✓ Accept | Aligned | ▼ | |
| rec-002... | ✗ Reject | ✓ Accept | Misaligned | ▼ | |
| rec-003... | ✗ Reject | ✗ Reject | Aligned | ▼ | |
| rec-007... | ✗ Reject | ✓ Accept | Misaligned | ▼ | |
| rec-012... | ✓ Accept | ✓ Accept | Aligned | ▼ |
The Metrics tab provides a comprehensive statistical view of how well your LLM judge aligns with human judgment.
| LLM: Accept | LLM: Reject | |
|---|---|---|
| Human: Accept | 13 | 3 |
| Human: Reject | 2 | 7 |
The LLM judge tends to be stricter than human reviewers. Consider relaxing the evaluation criteria. Overall accuracy: 80.0%.
| Metric | Formula | What It Tells You |
|---|---|---|
| Cohen's Kappa | (P_o - P_e) / (1 - P_e) |
Agreement beyond chance. Values: <0.2 slight, 0.2–0.4 fair, 0.4–0.6 moderate, 0.6–0.8 substantial, >0.8 almost perfect. |
| F1 Score | 2 * (P * R) / (P + R) |
Harmonic mean of precision and recall. A balanced measure of overall accuracy. |
| Precision | TP / (TP + FP) |
Of all LLM accepts, how many were actually good? Low precision = judge is too lenient. |
| Recall | TP / (TP + FN) |
Of all human accepts, how many did the LLM find? Low recall = judge is too strict. |
| Specificity | TN / (TN + FP) |
Of all human rejects, how many did the LLM correctly reject? |
The 2x2 confusion matrix visualizes the four possible outcomes of each comparison:
The interpretation text below the matrix automatically tells you whether your judge tends to be too lenient (FP > FN) or too strict (FN > FP), guiding your next prompt iteration.
The Insights tab provides AI-powered analysis of misalignment patterns and automatic prompt optimization to improve judge accuracy.
The LLM judge shows a tendency toward false positives, accepting responses that humans reject. This is primarily driven by incomplete coverage checks — the judge fails to verify that all aspects of a question are addressed.
Estimated 10–15% improvement in alignment by addressing completeness verification and adding stricter sub-question coverage requirements.
The iterative calibration loop:
CaliberHQ is designed for iterative improvement. Here is the recommended workflow:
Upload your data, annotate a representative sample (at least 20–30 records), and run the first evaluation with default prompts. Record the baseline Kappa.
Use Pattern Insights to identify systematic failure modes. Convert the most impactful patterns into evaluation criteria using the "Use as criteria" action.
For tricky edge cases, add few-shot examples showing the expected judgment. Prioritize examples from the misaligned set.
Re-run the evaluation after each change. Track how Kappa, F1, and the confusion matrix evolve. Use the Insights tab for AI-driven prompt optimization when you plateau.
Once satisfied with alignment, export the calibrated judge configuration (system prompt + evaluation criteria + few-shot examples) for use in your production evaluation pipeline.
CaliberHQ can be fine-tuned through the Configure modal (gear icon in the header) and the Build Eval step panels.
Accessible from the gear icon on the annotation card. Lets you control:
id)query, actual_output)CaliberHQ persists the following to localStorage so you don't lose work between sessions: