Skip to content

Pattern Discovery

Pattern Discovery clusters free-text evidence into recurring themes and distills them into actionable learning artifacts. It works with any text source — annotation notes, bug reports, user feedback, support tickets — and provides a full pipeline from raw text to validated insights.

What You'll Learn

1

Evidence Clustering

Cluster any text into semantic themes using LLM, BERTopic, or hybrid methods.

2

Evidence Pipeline

Full pipeline: sanitize → cluster → distill → validate → dedupe → sink.

3

Learning Artifacts

Distill clusters into titled insights with recommended actions and confidence scores.

4

Quality & Safety

Validation gates, PII filtering, recurrence checks, and deduplication strategies.

Quick Start

Clustering Only

Cluster a list of EvidenceItem objects into semantic groups:

from axion.caliber import PatternDiscovery, EvidenceItem, ClusteringMethod

evidence = [
    EvidenceItem(id='bug_1', text='Checkout button unresponsive on mobile Safari'),
    EvidenceItem(id='bug_2', text='Payment form does not submit on iPhone'),
    EvidenceItem(id='bug_3', text='Cart total shows wrong currency symbol'),
    EvidenceItem(id='bug_4', text='Currency formatting broken for EUR prices'),
]

discovery = PatternDiscovery(model_name='gpt-4o-mini', min_category_size=2)
result = await discovery.discover_from_evidence(evidence)

for pattern in result.patterns:
    print(f'[{pattern.category}] ({pattern.count} items): {pattern.record_ids}')

Full Pipeline

Run the complete pipeline — clustering, distillation into learning artifacts, validation, and persistence:

from axion.caliber import EvidencePipeline, InMemorySink, MetadataConfig

sink = InMemorySink()

pipeline = EvidencePipeline(
    model_name='gpt-4o-mini',
    recurrence_threshold=2,
    min_category_size=2,
    domain_context='E-commerce platform bug triage',
    sink=sink,
)

result = await pipeline.run(evidence)

for learning in result.learnings:
    print(f'{learning.title} (confidence={learning.confidence})')
    for action in learning.recommended_actions:
        print(f'  - {action}')

Core Concepts

EvidenceItem

The universal input type. Any text source with an ID, the text itself, optional metadata, and an optional source reference for provenance tracking:

EvidenceItem(
    id='ticket_42',
    text='Search returns no results for valid product names',
    metadata={'severity': 'high', 'platform': 'web'},
    source_ref='jira_PROJ-42',  # for recurrence checking
)

Clustering Methods

Method How it works Best for
ClusteringMethod.LLM LLM reads all evidence and outputs semantic clusters Small-medium datasets, best label quality
ClusteringMethod.BERTOPIC BERTopic statistical topic modeling, no LLM calls Large datasets, cost-sensitive, 5+ documents
ClusteringMethod.HYBRID BERTopic clustering + LLM label refinement Large datasets where you want readable labels
# LLM-only (default)
result = await discovery.discover_from_evidence(evidence, method=ClusteringMethod.LLM)

# BERTopic (no LLM cost)
result = await discovery.discover_from_evidence(evidence, method=ClusteringMethod.BERTOPIC)

# Hybrid: BERTopic clusters + LLM-refined labels
result = await discovery.discover_from_evidence(evidence, method=ClusteringMethod.HYBRID)

Learning Artifacts

The pipeline distills each cluster into a LearningArtifact — a structured insight ready for consumption:

LearningArtifact(
    title='Mobile Checkout Failures on iOS',
    content='Multiple reports of checkout and payment flows...',
    tags=['ios', 'checkout', 'mobile'],
    confidence=0.9,
    supporting_item_ids=['bug_1', 'bug_2', 'feedback_1'],
    recommended_actions=[
        'Investigate touch event handling in mobile Safari',
        'Add fallback submit mechanism for iOS WebKit',
    ],
    counterexamples=['bug_3'],       # IDs that look similar but differ
    scope='iOS mobile checkout flow',
    when_not_to_apply='Desktop browsers or Android devices',
)

Evidence Pipeline

The EvidencePipeline orchestrates the full workflow:

graph LR
    S["Sanitize"] --> C["Cluster"]
    C --> F["Pre-filter"]
    F --> D["Distill"]
    D --> V["Validate"]
    V --> R["Recurrence Check"]
    R --> T["Tag Normalize"]
    T --> DD["Dedupe"]
    DD --> SK["Sink"]

Pipeline Stages

Stage What it does Configurable via
Sanitize Optional async text cleaning sanitizer= (implements Sanitizer protocol)
Cluster Groups evidence into themes method=, clusterer= (implements EvidenceClusterer)
Pre-filter Drops clusters below recurrence_threshold recurrence_threshold=
Distill LLM converts each cluster into a learning artifact domain_context=, max_concurrent_distillations=
Validate Repairs hallucinated IDs, enforces quality gates Automatic
Recurrence Ensures learnings span multiple unique sources recurrence_threshold=
Tag normalize Lowercase, deduplicate, cap at 10 tags Automatic
Dedupe Removes near-duplicate learnings deduper= (implements Deduper)
Sink Persists artifacts with provenance metadata sink= (implements ArtifactSink)

Configuration

pipeline = EvidencePipeline(
    # LLM settings
    model_name='gpt-4o-mini',
    llm_provider='openai',

    # Clustering
    method=ClusteringMethod.LLM,
    min_category_size=2,
    max_notes=200,

    # Distillation
    domain_context='Customer support ticket analysis',
    max_concurrent_distillations=5,

    # Quality gates
    recurrence_threshold=2,  # need 2+ unique sources per learning

    # Metadata
    metadata_config=MetadataConfig(
        include_in_clustering=True,
        include_in_distillation=True,
        allowed_keys={'severity', 'platform', 'category'},
    ),

    # Plugins
    sink=InMemorySink(),
    deduper=InMemoryDeduper(),
)

Pipeline Result

result = await pipeline.run(evidence)

result.clustering_result   # PatternDiscoveryResult — the raw clusters
result.learnings           # list[LearningArtifact] — distilled insights
result.filtered_count      # clusters dropped by pre-filter
result.deduplicated_count  # learnings removed by deduper
result.validation_repairs  # count of hallucinated-ID repairs
result.sink_ids            # IDs returned by the sink after persistence

Metadata Handling

MetadataConfig

Control how metadata flows through the pipeline:

from axion.caliber import MetadataConfig

config = MetadataConfig(
    include_in_clustering=True,      # attach metadata to clustering prompts
    include_in_distillation=True,    # attach metadata to distillation prompts
    allowed_keys={'severity', 'platform'},  # only these keys pass through
    denied_keys=set(),               # explicit denylist (PII defaults apply)
)

PII Safety

By default, metadata keys matching common PII patterns are excluded:

email, phone, name, address, ssn, password, token, credit_card

These are filtered before any data reaches an LLM. You can customize the denylist via MetadataConfig.denied_keys.

Metadata in Clusters

When include_in_clustering=True, each evidence note sent to the LLM includes a compact header:

[meta: severity=high, platform=ios] Checkout button unresponsive on mobile Safari

During distillation, cluster-level metadata is aggregated into frequency distributions so the LLM sees patterns like "severity: high (3), medium (1)".

Plugins

Sinks (Output Storage)

Dict-based storage, ideal for testing and notebooks:

from axion.caliber import InMemorySink

sink = InMemorySink()
result = await pipeline.run(evidence)

for sid, entry in sink.artifacts.items():
    print(entry['artifact'].title, entry['provenance'].timestamp)

Appends each artifact as a JSON line to a file:

from axion.caliber import JsonlSink

sink = JsonlSink(path='learnings.jsonl')
result = await pipeline.run(evidence)
# Each learning is appended as one JSON line

Implement the ArtifactSink protocol:

from axion.caliber import ArtifactSink, LearningArtifact, Provenance

class DatabaseSink:
    async def write(
        self,
        artifact: LearningArtifact,
        provenance: Provenance,
    ) -> str:
        # persist to your database
        return record_id

Dedupers

Case-insensitive title matching — lightweight, no external dependencies:

from axion.caliber import InMemoryDeduper

deduper = InMemoryDeduper()

Cosine similarity on OpenAI embeddings (threshold defaults to 0.85):

from axion.caliber import EmbeddingDeduper

deduper = EmbeddingDeduper(threshold=0.85)  # requires axion[embeddings]

Validation & Quality Gates

The pipeline automatically validates every learning artifact:

Hallucinated ID Repair

If the LLM references item IDs that don't exist in the cluster, they are silently removed. If all IDs are invalid, the learning is dropped entirely.

Confidence Demotion

Learnings with confidence >= 0.7 but no recommended actions are demoted to confidence = 0.69 — the reasoning being that a high-confidence insight should be actionable.

Recurrence Checking

A learning must be backed by evidence from multiple unique sources (controlled by recurrence_threshold). Two chunks from the same conversation count as one source — the check uses source_ref to deduplicate.

# Standalone usage
from axion.caliber.pattern_discovery._utils import validate_learning

repaired, repair_count = validate_learning(raw_learning, cluster_item_ids)

Legacy API

The original discover() method still works for annotation-based workflows:

from axion.caliber import PatternDiscovery, AnnotatedItem

annotations = {
    'rec_1': AnnotatedItem(record_id='rec_1', score=0, notes='Missing context'),
    'rec_2': AnnotatedItem(record_id='rec_2', score=0, notes='Too verbose'),
}

discovery = PatternDiscovery(model_name='gpt-4o-mini')
result = await discovery.discover(annotations, method=ClusteringMethod.LLM)

This is equivalent to converting annotations to EvidenceItem objects and calling discover_from_evidence(). The conversion happens automatically via the compatibility layer.

Using with Evaluation Issues

The InsightExtractor bridges evaluation issues with this pipeline. It converts ExtractedIssue objects (from the Issue Extractor) into EvidenceItem objects and runs them through EvidencePipeline to discover cross-metric patterns:

from axion.reporting import IssueExtractor, InsightExtractor

issues = IssueExtractor().extract_from_evaluation(results)
insights = await InsightExtractor(model_name='gpt-4o-mini').analyze(issues)

for pattern in insights.patterns:
    if pattern.is_cross_metric:
        print(f"{pattern.category}: {', '.join(pattern.metrics_involved)}")

See the Cross-Metric Insight Extraction Guide for full details.

API Summary

API Input Output Use when
PatternDiscovery.discover() Dict[str, AnnotatedItem] PatternDiscoveryResult You have annotation notes from human evaluators
PatternDiscovery.discover_from_evidence() Sequence[EvidenceItem] or Dict PatternDiscoveryResult Clustering only, from any text source
EvidencePipeline.run() Sequence[EvidenceItem] or Dict PipelineResult Full pipeline: cluster + distill + validate + dedupe + sink
InsightExtractor.analyze() IssueExtractionResult InsightResult Cross-metric pattern discovery from evaluation issues