Cross-Metric Insight Extraction¶

Discover cross-metric patterns from evaluation issues using semantic clustering. The InsightExtractor bridges the Issue Extractor with the Evidence Pipeline to surface insights that span multiple metrics.

Overview¶

The IssueExtractor groups issues mechanically by metric + signal name. But real problems often cut across metrics — for example, faithfulness and contextual recall failures can both point to the same retrieval problem. The InsightExtractor uses semantic clustering to find these cross-metric patterns automatically.

✓

Cross-Metric Patterns

Detects issues that span multiple metrics pointing to a shared root cause.

✓

Actionable Learnings

Distills clusters into titled insights with recommended actions and confidence scores.

✓

Composable & Opt-In

Purely additive — no changes to IssueExtractor or EvidencePipeline.

✓

Pluggable Pipeline

Full control via pipeline overrides, sinks, dedupers, and clustering methods.

Quick Start¶

from axion.reporting import IssueExtractor, InsightExtractor
from axion.runners import evaluation_runner

# 1. Run your evaluation
results = await evaluation_runner(dataset, metrics)

# 2. Extract issues
issue_extractor = IssueExtractor()
issues = issue_extractor.extract_from_evaluation(results)

# 3. Discover cross-metric patterns
insight_extractor = InsightExtractor(model_name='gpt-4o-mini')
insights = await insight_extractor.analyze(issues)

# 4. Explore results
for pattern in insights.patterns:
    label = "CROSS-METRIC" if pattern.is_cross_metric else "single-metric"
    print(f"[{label}] {pattern.category} ({pattern.count} issues)")
    print(f"  Metrics: {', '.join(pattern.metrics_involved)}")
    print(f"  Test cases: {pattern.distinct_test_cases}")

for learning in insights.learnings:
    print(f"\n{learning.title} (confidence={learning.confidence})")
    for action in learning.recommended_actions:
        print(f"  - {action}")

How It Works¶

The InsightExtractor performs three steps:

graph LR
    I["IssueExtractionResult"] --> A["Convert to EvidenceItems"]
    A --> P["EvidencePipeline.run()"]
    P --> R["Build InsightPatterns"]

Convert — Each ExtractedIssue is mapped to an EvidenceItem with reasoning-first text and structured metadata. Issues with no reasoning and no query are skipped.
Cluster & Distill — The EvidencePipeline clusters evidence semantically and distills each cluster into learning artifacts with recommended actions.
Enrich — Each cluster is enriched with cross-metric metadata: which metrics are involved, whether it spans multiple metrics, and how many distinct test cases it covers.

Issue-to-Evidence Mapping¶

Each issue is converted to an EvidenceItem for clustering:

Issue Field	Maps To	Purpose
`reasoning` + `signal cue` + `query`	`EvidenceItem.text`	Clustering input
`test_case_id`	`source_ref`	Recurrence grouping
`metric_name`, `signal_name`, `value`, `score`	`metadata`	Cross-metric analysis
SHA-256 of key fields	`id`	Stable, deterministic IDs

The text is structured reasoning-first for better clustering:

The claim contradicts the provided context about system requirements.
[faithfulness / faithfulness_verdict: CONTRADICTORY]
Query: What Python versions are supported?

Issues with no reasoning and no query produce no meaningful text and are automatically filtered out.

Configuration¶

Basic Options¶

insight_extractor = InsightExtractor(
    model_name='gpt-4o-mini',       # LLM for clustering + distillation
    method=ClusteringMethod.LLM,     # LLM, BERTOPIC, or HYBRID
    recurrence_threshold=2,          # min unique test cases per learning
    max_items=50,                    # max evidence items to process
    min_category_size=2,             # min items per cluster
)

Clustering Methods¶

Method	Best for	LLM calls
`ClusteringMethod.LLM`	Small-medium issue sets, best label quality	Yes
`ClusteringMethod.BERTOPIC`	Large issue sets, cost-sensitive	No (clustering only)
`ClusteringMethod.HYBRID`	Large sets where you want readable labels	Partial

from axion.caliber.pattern_discovery.models import ClusteringMethod

# Cost-effective for large evaluations
extractor = InsightExtractor(
    model_name='gpt-4o-mini',
    method=ClusteringMethod.BERTOPIC,
)

Custom Pipeline¶

For full control, pass a pre-configured EvidencePipeline:

from axion.caliber.pattern_discovery import EvidencePipeline, InMemorySink

sink = InMemorySink()
pipeline = EvidencePipeline(
    model_name='gpt-4o-mini',
    recurrence_threshold=2,
    min_category_size=2,
    domain_context='RAG system evaluation',
    sink=sink,
)

extractor = InsightExtractor(pipeline=pipeline)
insights = await extractor.analyze(issues)

# Access persisted artifacts
for sid, entry in sink.artifacts.items():
    print(entry['artifact'].title)

No mixing

Passing both pipeline= and additional keyword arguments raises a ValueError. Use one or the other.

Forwarding Pipeline Options¶

Any extra keyword arguments are forwarded to EvidencePipeline:

extractor = InsightExtractor(
    model_name='gpt-4o-mini',
    seed=42,                          # reproducible sampling
    max_concurrent_distillations=3,   # limit parallel LLM calls
)

Working with Results¶

InsightResult¶

insights = await extractor.analyze(issues)

insights.patterns              # List[InsightPattern] — discovered clusters
insights.learnings             # List[LearningArtifact] — distilled insights
insights.total_issues_analyzed # how many issues had meaningful text
insights.clustering_method     # the ClusteringMethod enum used
insights.pipeline_result       # full PipelineResult for advanced use

InsightPattern¶

Each pattern represents a discovered cluster:

for pattern in insights.patterns:
    print(pattern.category)          # "Retrieval Coverage Gaps"
    print(pattern.description)       # "Issues where context doesn't cover..."
    print(pattern.count)             # 8
    print(pattern.metrics_involved)  # ['faithfulness', 'contextual_recall']
    print(pattern.is_cross_metric)   # True
    print(pattern.distinct_test_cases)  # 5
    print(pattern.examples)          # excerpted evidence text
    print(pattern.confidence)        # 0.85

The is_cross_metric flag is True when a cluster contains issues from 2 or more different metrics — the key signal that a shared root cause exists.

LearningArtifact¶

Each learning is a structured, actionable insight:

for learning in insights.learnings:
    print(learning.title)              # "Mobile Checkout Failures on iOS"
    print(learning.content)            # detailed explanation
    print(learning.confidence)         # 0.9
    print(learning.tags)               # ['retrieval', 'context_quality']
    print(learning.recommended_actions)  # ['Improve chunk overlap...']
    print(learning.supporting_item_ids)  # evidence IDs backing this

Sync Usage¶

For non-async code:

insights = extractor.analyze_sync(issues)

Complete Example¶

import asyncio
from axion import Dataset
from axion.metrics import Faithfulness, AnswerRelevancy, ContextualRecall
from axion.runners import evaluation_runner
from axion.reporting import IssueExtractor, InsightExtractor

async def discover_patterns():
    # 1. Run evaluation
    dataset = Dataset.from_csv('test_cases.csv')
    results = await evaluation_runner(
        dataset=dataset,
        metrics=[Faithfulness(), AnswerRelevancy(), ContextualRecall()],
    )

    # 2. Extract issues
    issue_extractor = IssueExtractor(score_threshold=0.0)
    issues = issue_extractor.extract_from_evaluation(results)
    print(f'Found {issues.issues_found} issues across {issues.total_test_cases} test cases')

    if issues.issues_found == 0:
        print('No issues found.')
        return

    # 3. Discover cross-metric patterns
    insight_extractor = InsightExtractor(
        model_name='gpt-4o-mini',
        recurrence_threshold=2,
    )
    insights = await insight_extractor.analyze(issues)

    # 4. Report cross-metric patterns
    cross_metric = [p for p in insights.patterns if p.is_cross_metric]
    print(f'\n--- {len(cross_metric)} cross-metric patterns found ---')
    for pattern in cross_metric:
        print(f'\n{pattern.category} ({pattern.count} issues)')
        print(f'  Metrics: {", ".join(pattern.metrics_involved)}')
        print(f'  Distinct test cases: {pattern.distinct_test_cases}')

    # 5. Show actionable learnings
    print(f'\n--- {len(insights.learnings)} learnings ---')
    for learning in insights.learnings:
        print(f'\n{learning.title} (confidence={learning.confidence})')
        print(f'  {learning.content[:200]}')
        for action in learning.recommended_actions:
            print(f'  -> {action}')

asyncio.run(discover_patterns())

Relationship to Other Tools¶

Tool	What it does	When to use
`IssueExtractor`	Extracts failing signals by metric + signal	Always — it produces the input for `InsightExtractor`
`InsightExtractor`	Clusters issues semantically across metrics	When you want cross-metric root cause analysis
`EvidencePipeline`	General-purpose evidence clustering + distillation	When you have arbitrary text evidence (not evaluation issues)

The InsightExtractor is the bridge: it converts IssueExtractor output into EvidenceItem format and runs it through the EvidencePipeline.

API Reference¶

See the full Insight Extractor API Reference for class signatures and data class details.