Axion¶

White-box evaluation—Axion empowers builders with actionable signals, automated pipelines, and fully transparent metrics. See exactly why your agent succeeds or fails.

Philosophy

Agents are sports cars.
Evals are the seatbelt.

It won't make you faster. It won't win you anything. But it will keep a bad release from turning into a public incident. If you skip evals because you're a "good driver," you're not a serious person.

1

Evals are adult supervision

Not rocket science. Simple checks that prevent simple disasters. Start here, stay here, and expand only when the basics are solid.

2

Define "good" in one sentence

If you can't articulate what success looks like, no framework will save you. Clarity first, tooling second.

3

Start with pass/fail

Add nuance only after you've earned it. A binary gate catches more failures than a sophisticated rubric you never run.

4

Calibrate your judges

LLM judges are useful. They're also liars with confidence. Calibrate them against humans or don't pretend you measured anything.

Evaluation Flywheel Why Ground Truth Matters

Component Arsenal¶

Core Primitives

Structured Handlers & Tool Abstractions

Build composable toolchains with pre-defined base classes for structured LLMs, tools, and knowledge retrieval. Eliminate boilerplate, enforce consistency, and focus on your logic.

API Integrations

Extensible Backend Access

Base API classes with built-in tracing and authentication support. Build your own API integrations with ease or extend the provided abstractions.

Evaluation Engine & Metric Suite

Built-in & Open-Source Friendly

Define experiments, run batch evaluations, calibrate judges, and score using our native metrics—or integrate with open libraries for broader experimentation coverage.

RAG Toolbox

Everything Retrieval—Chunking, Grounding, Response Assembly

End-to-end support for grounding pipelines with modular components you can reuse across use cases.

Observability at Its Core

Trace, Log, Debug with Confidence

Native support for Logfire, structured logging, and run tracking gives you production-grade visibility across every step of your AI pipeline.

Designed for Scale

Async-Native, Pydantic-Validated, Error Resilient

Async support everywhere. Predictable, structured I/O with Pydantic validation. Robust error handling out-of-the-box.

Hierarchical Scoring¶

What sets Axion apart

Our scoring framework is hierarchical by design—moving from a single overall score down into layered sub-scores. This delivers a diagnostic map of quality, not just a number.

                ┌─────────────────┐
                │  Overall Score  │
                │      0.82       │
                └────────┬────────┘
                         │
      ┌──────────────────┼──────────────────┐
      ▼                  ▼                  ▼
┌───────────┐      ┌───────────┐      ┌───────────┐
│ Relevance │      │ Accuracy  │      │   Tone    │
│   0.91    │      │   0.78    │      │   0.85    │
└───────────┘      └───────────┘      └───────────┘

Instant Root Cause Diagnosis

Drill down to pinpoint whether issues stem from relevance, accuracy, tone, or other dimensions—no more guessing from flat scores.

Strategic Prioritization

Forces clarity on what really matters for your business by breaking quality into weighted layers.

Actionable Feedback Loop

Each layer translates directly into actions—retraining, prompt adjustments, or alignment tuning.

Customizable to Business Goals

Weight and expand dimensions to match your unique KPIs. Define what "good AI" means for you.

from axion.runners import evaluation_runner
from axion.metrics import AnswerRelevancy
from axion.dataset import DatasetItem

# Define hierarchical scoring configuration
config = {
    'metric': {
        'Relevance': AnswerRelevancy(metric_name='Relevancy'),
    },
    'model': {
        'ANSWER_QUALITY': {'Relevance': 1.0},
    },
    'weights': {
        'ANSWER_QUALITY': 1.0,
    }
}

results = evaluation_runner(
    evaluation_inputs=[data_item],
    scoring_config=config,  # Or pass path to config.yaml
)

# Generate scorecard with hierarchical breakdown
results.to_scorecard()

Learn more about Hierarchical Scoring →

Origin

Why "Axion"?

Agent X-Ray Inspection & Optimization Network

The name draws inspiration from the axion—a hypothetical particle in physics proposed to solve the "strong CP problem" in quantum chromodynamics. Physicists Frank Wilczek and Steven Weinberg named it after a laundry detergent, hoping it would "clean up" their theoretical mess.

Particle ↔ Toolkit

✶

Incredibly small, immensely powerful

Axions may account for the universe's dark matter through sheer numbers. This toolkit offers small, focused tools that combine to tackle AI evaluation at scale.

✧

Designed to clean things up

Named after a detergent to "clean up" a theoretical mess. Built to bring clarity and structure to the messy problem of agent evaluation.

⬢

Modular by nature

Lightweight components that work together to solve complex problems. Composable building blocks, not a monolithic framework.