Axion¶
White-box evaluation—Axion empowers builders with actionable signals, automated pipelines, and fully transparent metrics. See exactly why your agent succeeds or fails.
Agents are sports cars.
Evals are the seatbelt.
It won't make you faster. It won't win you anything. But it will keep a bad release from turning into a public incident. If you skip evals because you're a "good driver," you're not a serious person.
Evals are adult supervision
Not rocket science. Simple checks that prevent simple disasters. Start here, stay here, and expand only when the basics are solid.
Define "good" in one sentence
If you can't articulate what success looks like, no framework will save you. Clarity first, tooling second.
Start with pass/fail
Add nuance only after you've earned it. A binary gate catches more failures than a sophisticated rubric you never run.
Calibrate your judges
LLM judges are useful. They're also liars with confidence. Calibrate them against humans or don't pretend you measured anything.
Evaluation Flywheel Why Ground Truth Matters
Component Arsenal¶
Core Primitives
Structured Handlers & Tool AbstractionsBuild composable toolchains with pre-defined base classes for structured LLMs, tools, and knowledge retrieval. Eliminate boilerplate, enforce consistency, and focus on your logic.
API Integrations
Extensible Backend AccessBase API classes with built-in tracing and authentication support. Build your own API integrations with ease or extend the provided abstractions.
Evaluation Engine & Metric Suite
Built-in & Open-Source FriendlyDefine experiments, run batch evaluations, calibrate judges, and score using our native metrics—or integrate with open libraries for broader experimentation coverage.
RAG Toolbox
Everything Retrieval—Chunking, Grounding, Response AssemblyEnd-to-end support for grounding pipelines with modular components you can reuse across use cases.
Observability at Its Core
Trace, Log, Debug with ConfidenceNative support for Logfire, structured logging, and run tracking gives you production-grade visibility across every step of your AI pipeline.
Designed for Scale
Async-Native, Pydantic-Validated, Error ResilientAsync support everywhere. Predictable, structured I/O with Pydantic validation. Robust error handling out-of-the-box.
Hierarchical Scoring¶
What sets Axion apart
Our scoring framework is hierarchical by design—moving from a single overall score down into layered sub-scores. This delivers a diagnostic map of quality, not just a number.
┌─────────────────┐
│ Overall Score │
│ 0.82 │
└────────┬────────┘
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Relevance │ │ Accuracy │ │ Tone │
│ 0.91 │ │ 0.78 │ │ 0.85 │
└───────────┘ └───────────┘ └───────────┘
Instant Root Cause Diagnosis
Drill down to pinpoint whether issues stem from relevance, accuracy, tone, or other dimensions—no more guessing from flat scores.
Strategic Prioritization
Forces clarity on what really matters for your business by breaking quality into weighted layers.
Actionable Feedback Loop
Each layer translates directly into actions—retraining, prompt adjustments, or alignment tuning.
Customizable to Business Goals
Weight and expand dimensions to match your unique KPIs. Define what "good AI" means for you.
from axion.runners import evaluation_runner
from axion.metrics import AnswerRelevancy
from axion.dataset import DatasetItem
# Define hierarchical scoring configuration
config = {
'metric': {
'Relevance': AnswerRelevancy(metric_name='Relevancy'),
},
'model': {
'ANSWER_QUALITY': {'Relevance': 1.0},
},
'weights': {
'ANSWER_QUALITY': 1.0,
}
}
results = evaluation_runner(
evaluation_inputs=[data_item],
scoring_config=config, # Or pass path to config.yaml
)
# Generate scorecard with hierarchical breakdown
results.to_scorecard()
Learn more about Hierarchical Scoring →
Why "Axion"?
Agent X-Ray Inspection & Optimization Network
The name draws inspiration from the axion—a hypothetical particle in physics proposed to solve the "strong CP problem" in quantum chromodynamics. Physicists Frank Wilczek and Steven Weinberg named it after a laundry detergent, hoping it would "clean up" their theoretical mess.
| ✶ |
Incredibly small, immensely powerful Axions may account for the universe's dark matter through sheer numbers. This toolkit offers small, focused tools that combine to tackle AI evaluation at scale. |
| ✧ |
Designed to clean things up Named after a detergent to "clean up" a theoretical mess. Built to bring clarity and structure to the messy problem of agent evaluation. |
| ⬢ |
Modular by nature Lightweight components that work together to solve complex problems. Composable building blocks, not a monolithic framework. |