Issue Extraction & Analysis¶
Extract and analyze low-score signals from evaluation results for debugging, reporting, and LLM-powered issue summarization.
Overview¶
After running evaluations, you often need to understand why certain test cases failed. The IssueExtractor automatically identifies failing signals across all metrics.
Unified Extraction
Works with any metric, built-in or custom.
Grouped Analysis
Similar issues consolidated for pattern detection.
LLM-Ready Prompts
Generate summaries with optional AI analysis.
Extensible Adapters
Register custom signal detection for your metrics.
Quick Start¶
from axion.reporting import IssueExtractor
from axion.runners import evaluation_runner
# Run your evaluation
results = await evaluation_runner(dataset, metrics)
# Extract issues (signals with score <= 0)
extractor = IssueExtractor()
issues = extractor.extract_from_evaluation(results)
# View summary
print(f"Found {issues.issues_found} issues across {issues.total_test_cases} test cases")
# Generate LLM prompt for analysis
prompt = extractor.to_prompt_text(issues)
Basic Usage¶
Extract Issues from Evaluation Results¶
from axion.reporting import IssueExtractor
# Default: extract signals with score <= 0.0
extractor = IssueExtractor()
issues = extractor.extract_from_evaluation(evaluation_result)
# Access extracted issues
for issue in issues.all_issues:
print(f"Test: {issue.test_case_id}")
print(f"Metric: {issue.metric_name}")
print(f"Signal: {issue.signal_name} = {issue.value}")
print(f"Score: {issue.score}")
print(f"Reasoning: {issue.reasoning}")
print("---")
Customize Extraction¶
extractor = IssueExtractor(
# Extract signals with score <= 0.5 (not just 0.0)
score_threshold=0.5,
# Include NaN scores as issues
include_nan=True,
# Only extract from specific metrics
metric_filters=['Faithfulness', 'Answer Relevancy'],
# Limit total issues extracted
max_issues=100,
# Sample 10% of test cases (deterministic by ID)
sample_rate=0.1,
# Context fields to include from test cases
include_context_fields=['query', 'actual_output', 'expected_output', 'retrieved_content']
)
Access Issues by Category¶
issues = extractor.extract_from_evaluation(results)
# Issues grouped by metric
for metric_name, metric_issues in issues.issues_by_metric.items():
print(f"{metric_name}: {len(metric_issues)} issues")
# Issues grouped by type (metric:signal combination)
for issue_type, type_issues in issues.issues_by_type.items():
print(f"{issue_type}: {len(type_issues)} issues")
Generating LLM Prompts¶
Basic Prompt Generation¶
# Generate a detailed prompt listing all issues
prompt = extractor.to_prompt_text(issues, max_issues=50)
print(prompt)
Output:
## Evaluation Issues Summary
**Evaluation:** RAG Quality Assessment
**Test Cases Analyzed:** 100
**Issues Found:** 23
### Issue Breakdown by Metric
- Faithfulness: 12 issues (8 CONTRADICTORY, 4 NO_EVIDENCE)
- Answer Relevancy: 7 issues (7 False)
- Answer Criteria: 4 issues (4 False)
### Detailed Issues
#### Issue 1: Faithfulness - faithfulness_verdict
- **Test Case:** test_case_42
- **Signal Group:** claim_0
- **Value:** CONTRADICTORY
- **Score:** 0.0
- **Reasoning:** "Context states Python 3.8+ required, contradicting the claim about 3.6 support"
- **Query:** "What Python versions are supported?"
- **Actual Output:** "Our product supports Python 3.6 and above..."
...
## Task
Analyze the quality issues found in this evaluation. Provide:
1. **Critical Failure Patterns:** What are the most common/severe issue types?
2. **Root Cause Analysis:** What systemic problems might be causing these failures?
3. **Recommended Improvements:** Specific actions to improve quality
4. **Priority Ranking:** Which issues should be addressed first?
Grouped Prompts (Token-Efficient)¶
For large evaluations, use grouped prompts to reduce token usage:
# Group similar issues together with representative examples
grouped_prompt = extractor.to_grouped_prompt_text(
issues,
max_groups=20, # Limit number of issue groups
max_examples_per_group=2 # Examples per group
)
Output:
## Evaluation Issues Summary (Grouped)
**Total Issues Found:** 156
**Issue Groups:** 8
### Issue Groups Overview
| Metric | Signal | Count | Values |
|--------|--------|-------|--------|
| Faithfulness | faithfulness_verdict | 45 | CONTRADICTORY, NO_EVIDENCE |
| Answer Criteria | is_covered | 38 | False |
| Answer Relevancy | is_relevant | 31 | False |
### Detailed Issue Groups
#### Group 1: Faithfulness - faithfulness_verdict
- **Total Issues:** 45
- **Failure Values:** CONTRADICTORY, NO_EVIDENCE
- **Affected Tests:** test_12, test_45, test_67, ... (+42 more)
**Representative Examples:**
*Example 1:*
- Test: test_12
- Value: CONTRADICTORY
- Reasoning: "Claim contradicts source documentation"
- Query: "What are the system requirements?"
*Example 2:*
- Test: test_45
- Value: NO_EVIDENCE
- Reasoning: "No supporting evidence found in context"
- Query: "Does it support Windows?"
LLM-Powered Group Summaries¶
Add AI-generated pattern summaries for each issue group:
from axion.llm_registry import LLMRegistry
# Get an LLM instance
reg = LLMRegistry('anthropic')
llm = reg.get_llm()
# Generate grouped prompt with LLM summaries (async)
grouped_prompt = await extractor.to_grouped_prompt_text_async(
issues,
llm=llm, # LLM generates 1-2 sentence summary per group
max_groups=15
)
Output includes AI-generated pattern analysis:
#### Group 1: Faithfulness - faithfulness_verdict
- **Total Issues:** 45
- **Pattern Summary:** Claims about version compatibility and system requirements
consistently contradict the official documentation, suggesting outdated training data
or hallucination patterns around technical specifications.
Full LLM Analysis with summarize()¶
For a complete automated analysis (like copy-pasting to ChatGPT/Gemini, but automated), use the summarize() method:
from axion.llm_registry import LLMRegistry
from axion.reporting import IssueExtractor
# Extract issues
extractor = IssueExtractor()
issues = extractor.extract_from_evaluation(results)
# Get LLM
reg = LLMRegistry('anthropic')
llm = reg.get_llm('claude-sonnet-4-20250514')
# Generate complete analysis
summary = await extractor.summarize(issues, llm=llm)
print(summary.text)
The default prompt generates: - Executive Summary - 2-3 sentence overview - Missing Concepts - Topics the AI consistently missed - Failure Categories Table - Structured breakdown with counts and examples - Root Cause Analysis - Systemic issues causing failures - Recommended Actions - Prioritized improvements
Custom Prompts¶
Override the default prompt with your own template:
custom_prompt = '''
Analyze these evaluation failures:
{overview}
{issue_data}
Provide:
1. Top 3 failure patterns
2. Quick wins to fix them
3. A table with Category, Count, and Example
'''
summary = await extractor.summarize(
issues,
llm=llm,
prompt_template=custom_prompt,
max_issues=50 # Limit issues in prompt
)
The template must include {overview} and {issue_data} placeholders.
Sync Version¶
For non-async code:
Structured Output for Programmatic Use¶
# Get structured data instead of text
llm_input = extractor.to_llm_input(issues, max_issues=50)
print(llm_input.evaluation_name) # "RAG Quality Assessment"
print(llm_input.total_test_cases) # 100
print(llm_input.issues_found) # 23
print(llm_input.issues_by_metric) # {'Faithfulness': 12, ...}
# Access detailed issue dicts
for issue_dict in llm_input.detailed_issues:
print(issue_dict['metric'])
print(issue_dict['signal_name'])
print(issue_dict['value'])
print(issue_dict['context']['query'])
Signal Adapter Registry¶
The SignalAdapterRegistry defines how to extract issues from each metric's signals. Axion includes adapters for all built-in metrics, but you can register custom adapters for your own metrics.
How Adapters Work¶
Each adapter specifies:
| Field | Description | Example |
|---|---|---|
headline_signals |
Signals that indicate pass/fail | ['is_relevant', 'verdict'] |
issue_values |
Values that indicate failures | {'is_relevant': [False], 'verdict': ['no']} |
context_signals |
Related signals for context | ['statement', 'reason', 'turn_index'] |
Built-in Adapters¶
from axion.reporting import SignalAdapterRegistry
# List all registered adapters
print(SignalAdapterRegistry.list_adapters())
# ['faithfulness', 'answer_criteria', 'answer_relevancy', 'answer_completeness',
# 'contextual_relevancy', 'contextual_recall', 'contextual_precision',
# 'factual_accuracy', 'pii_leakage', 'tool_correctness', ...]
Register a Custom Adapter¶
Best Practice: Define the adapter in the same file as your custom metric. This keeps the signal schema and adapter in sync.
# my_metrics/quality_checker.py
from axion.metrics import BaseMetric
from axion.reporting import SignalAdapterRegistry, MetricSignalAdapter
# 1. Define your metric
class QualityChecker(BaseMetric):
name = "Quality Checker"
async def a_score(self, item):
# Your scoring logic...
return MetricScore(
name=self.name,
score=score,
signals={
'quality_verdict': {'value': verdict, 'score': 1.0 if verdict == 'PASS' else 0.0},
'issues_found': {'value': issues, 'score': 1.0},
'reasoning': {'value': reason},
}
)
# 2. Register adapter alongside the metric
@SignalAdapterRegistry.register('quality_checker')
def _quality_checker_adapter():
return MetricSignalAdapter(
metric_key='quality_checker',
headline_signals=['quality_verdict'],
issue_values={'quality_verdict': ['FAIL', 'PARTIAL']},
context_signals=['issues_found', 'reasoning']
)
The adapter registers automatically when your metric module is imported.
Alternative: Direct Registration¶
For quick registration without a decorator:
SignalAdapterRegistry.register_adapter(
'another_metric',
MetricSignalAdapter(
metric_key='another_metric',
headline_signals=['is_valid'],
issue_values={'is_valid': [False, 'INVALID', 'ERROR']},
context_signals=['validation_errors', 'field_name']
)
)
What If No Adapter Exists?¶
Unregistered metrics still work - the extractor falls back to score-based extraction:
# Your custom metric without a registered adapter
# Signals with score <= threshold are automatically detected as issues
| Feature | Without Adapter | With Adapter |
|---|---|---|
| Score-based detection | Yes | Yes |
| Value-based detection | No | Yes |
| Explicit headline signals | No | Yes |
| Context signal extraction | Basic | Full |
Register an adapter when you want richer issue detection beyond just scores.
Complete Example¶
import asyncio
from axion import Dataset
from axion.metrics import Faithfulness, AnswerRelevancy, AnswerCriteria
from axion.runners import evaluation_runner
from axion.reporting import IssueExtractor, SignalAdapterRegistry, MetricSignalAdapter
from axion.llm_registry import LLMRegistry
async def analyze_evaluation():
# 1. Run evaluation
dataset = Dataset.from_csv("test_cases.csv")
results = await evaluation_runner(
dataset=dataset,
metrics=[Faithfulness(), AnswerRelevancy(), AnswerCriteria()]
)
# 2. Extract issues
extractor = IssueExtractor(
score_threshold=0.0,
include_context_fields=['query', 'actual_output', 'expected_output']
)
issues = extractor.extract_from_evaluation(results)
# 3. Quick summary
print(f"Evaluation: {issues.evaluation_name}")
print(f"Test cases: {issues.total_test_cases}")
print(f"Issues found: {issues.issues_found}")
print()
for metric, metric_issues in issues.issues_by_metric.items():
print(f" {metric}: {len(metric_issues)} issues")
# 4. Generate LLM analysis prompt
if issues.issues_found > 0:
# Use grouped prompt for efficiency
reg = LLMRegistry('anthropic')
llm = reg.get_llm('claude-sonnet-4-20250514')
prompt = await extractor.to_grouped_prompt_text_async(
issues,
llm=llm,
max_groups=10
)
# Send to LLM for analysis
analysis = await llm.acomplete(prompt)
print("\n=== LLM Analysis ===")
print(analysis.text)
asyncio.run(analyze_evaluation())
API Reference¶
IssueExtractor¶
class IssueExtractor:
def __init__(
self,
score_threshold: float = 0.0, # Signals <= this are issues
include_nan: bool = False, # Include NaN scores
include_context_fields: List[str] = ['query', 'actual_output', 'expected_output'],
metric_filters: List[str] = None, # Only these metrics
max_issues: int = None, # Hard limit
sample_rate: float = None, # 0.0-1.0 sampling
): ...
def extract_from_evaluation(self, result: EvaluationResult) -> IssueExtractionResult: ...
def extract_from_test_result(self, test_result: TestResult, index: int) -> List[ExtractedIssue]: ...
def extract_from_metric_score(self, metric_score: MetricScore, ...) -> List[ExtractedIssue]: ...
def to_llm_input(self, result: IssueExtractionResult, max_issues: int = 50) -> LLMSummaryInput: ...
def to_prompt_text(self, result: IssueExtractionResult, max_issues: int = 50) -> str: ...
def to_grouped_prompt_text(self, result: IssueExtractionResult, llm=None, max_groups=20, max_examples_per_group=2) -> str: ...
async def to_grouped_prompt_text_async(self, result: IssueExtractionResult, llm=None, ...) -> str: ...
SignalAdapterRegistry¶
class SignalAdapterRegistry:
@classmethod
def register(cls, metric_key: str) -> decorator: ... # Decorator for registration
@classmethod
def register_adapter(cls, metric_key: str, adapter: MetricSignalAdapter) -> None: ...
@classmethod
def get(cls, metric_name: str) -> Optional[MetricSignalAdapter]: ...
@classmethod
def list_adapters(cls) -> List[str]: ...
Data Classes¶
@dataclass
class ExtractedIssue:
test_case_id: str
metric_name: str
signal_group: str
signal_name: str
value: Any
score: float
description: Optional[str]
reasoning: Optional[str]
item_context: Dict[str, Any]
source_path: str
raw_signal: Dict[str, Any]
@dataclass
class IssueExtractionResult:
run_id: str
evaluation_name: Optional[str]
total_test_cases: int
total_signals_analyzed: int
issues_found: int
issues_by_metric: Dict[str, List[ExtractedIssue]]
issues_by_type: Dict[str, List[ExtractedIssue]]
all_issues: List[ExtractedIssue]
@dataclass
class MetricSignalAdapter:
metric_key: str
headline_signals: List[str]
issue_values: Dict[str, List[Any]]
context_signals: List[str]
Next Step: Cross-Metric Insights¶
The IssueExtractor groups issues by metric + signal name. To discover patterns that span multiple metrics — for example, faithfulness and contextual recall failures both pointing to a retrieval problem — use the InsightExtractor:
from axion.reporting import IssueExtractor, InsightExtractor
# Extract issues as usual
extractor = IssueExtractor()
issues = extractor.extract_from_evaluation(results)
# Discover cross-metric patterns
insight_extractor = InsightExtractor(model_name='gpt-4o-mini')
insights = await insight_extractor.analyze(issues)
for pattern in insights.patterns:
if pattern.is_cross_metric:
print(f"{pattern.category}: {', '.join(pattern.metrics_involved)}")
Cross-Metric Insight Extraction Guide
Best Practices¶
Start with Defaults
IssueExtractor() works well for most cases out of the box.
Grouped Prompts
Use grouped prompts for large evals — reduces tokens by 50–90%.
Register Adapters
Custom adapters enable value-based detection for your metrics.
Sample Large Datasets
Use sample_rate to manage volume on very large evaluations.