Faithfulness¶
LLM-Powered Knowledge Single Turn
At a Glance¶
Score Range
0.0 ──────── 1.0Clamped from weighted average
Default Threshold
0.5Pass/fail cutoff
Required Inputs
query actual_output retrieved_contentThree fields needed
What It Measures
Faithfulness evaluates whether every claim in the AI's response can be directly inferred from the provided source material. It acts as your primary defense against hallucinations—ensuring the AI summarizes existing knowledge rather than inventing facts.
| Score | Interpretation |
|---|---|
| 1.0 | Every claim is fully supported by context |
| 0.7+ | Most claims supported, minor gaps |
| 0.5 | Threshold—mixture of supported and unsupported |
| < 0.5 | Significant hallucinations or contradictions |
- RAG systems & document Q&A
- Knowledge base assistants
- Summarization tasks
- Any system with retrieved context
- Creative writing / brainstorming
- Opinion or preference questions
- No retrieved context available
- Open-ended generation tasks
See Also: Answer Relevancy
Faithfulness checks if claims are grounded in the source context (factual accuracy). Answer Relevancy checks if statements address the user's query (topical alignment).
Use both together for comprehensive RAG evaluation.
How It Works
The metric uses an Evaluator LLM to decompose the response into atomic claims, then verify each against the retrieved context.
Step-by-Step Process¶
flowchart TD
subgraph INPUT["📥 Inputs"]
A[Query]
B[Retrieved Context]
C[AI Response]
end
subgraph EXTRACT["🔍 Step 1: Claim Extraction"]
D[StatementExtractor LLM]
E["Atomic Claims<br/><small>Self-contained, verifiable</small>"]
end
subgraph VERIFY["⚖️ Step 2: Verification"]
F[FaithfulnessJudge LLM]
G["Verdict per Claim"]
end
subgraph SCORE["📊 Step 3: Scoring"]
H["Sum Weighted Verdicts"]
I["Clamp to [0, 1]"]
J["Final Score"]
end
A & B & C --> D
D --> E
E --> F
B --> F
F --> G
G --> H
H --> I
I --> J
style INPUT stroke:#1E3A5F,stroke-width:2px
style EXTRACT stroke:#3b82f6,stroke-width:2px
style VERIFY stroke:#f59e0b,stroke-width:2px
style SCORE stroke:#10b981,stroke-width:2px
style J fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff
Each extracted claim receives a verdict with a corresponding weight. The final score is the weighted average, clamped to [0, 1].
The claim is explicitly stated in the context. Direct evidence exists.
Core subject is correct but claim exaggerates certainty or has minor inaccuracies.
Context doesn't contain information to verify the claim. Hallucination.
Evidence directly contradicts the claim. Critical factual error.
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
strict_mode |
bool |
False |
When True, NO_EVIDENCE verdicts receive -1.0 (same as contradictions), heavily penalizing hallucinations |
verdict_scores |
Dict[str, float] |
None |
Custom override for verdict weights. Takes precedence over strict_mode |
mode |
EvaluationMode |
GRANULAR |
Evaluation detail level (GRANULAR or HOLISTIC) |
Strict Mode
Enable strict_mode=True for high-stakes domains (legal, medical, financial) where any uncited claim is unacceptable—even if not directly contradicted.
Override default verdict weights for domain-specific calibration:
Code Examples¶
from axion.metrics import Faithfulness
from axion.dataset import DatasetItem
# Initialize with defaults
metric = Faithfulness()
item = DatasetItem(
query="What is the infield fly rule in baseball?",
actual_output="The infield fly rule prevents the defense from intentionally dropping a fly ball to turn a double play.",
retrieved_content=[
"The infield fly rule prevents unfair advantage.",
"Applies with runners on first and second.",
],
)
result = await metric.execute(item)
print(result.pretty())
from axion.metrics import Faithfulness
from axion.runners import MetricRunner
# Initialize with strict mode
faithfulness = Faithfulness(strict_mode=True)
runner = MetricRunner(metrics=[faithfulness])
results = await runner.run(dataset)
# Access detailed breakdown
for item_result in results:
print(f"Score: {item_result.score}")
print(f"Claims analyzed: {item_result.data.total_claims}")
for claim in item_result.data.judged_claims:
print(f" - {claim.verdict}: {claim.text}")
Metric Diagnostics¶
Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.
result = await metric.execute(item)
print(result.pretty()) # Human-readable summary
result.signals # Full diagnostic breakdown
📊 FaithfulnessResult Structure
FaithfulnessResult(
{
"overall_score": 0.5,
"total_claims": 2,
"verdict_counts": {
"fully_supported": 1,
"partially_supported": 0,
"no_evidence": 1,
"contradictory": 0
},
"judged_claims": [
{
"claim_text": "The infield fly rule prevents the defense from intentionally dropping a fly ball.",
"faithfulness_verdict": "Fully Supported",
"reason": "The evidence states that the infield fly rule prevents the defense from intentionally dropping a catchable fly ball."
},
{
"claim_text": "The infield fly rule is designed to prevent an easy double play when runners are on base.",
"faithfulness_verdict": "No Evidence",
"reason": "The evidence does not mention anything about preventing an easy double play when runners are on base."
}
]
}
)
Signal Fields¶
| Field | Type | Description |
|---|---|---|
overall_score |
float |
The 0-1 faithfulness score |
total_claims |
int |
Total claims extracted from the response |
verdict_counts |
Dict |
Breakdown by verdict type (fully_supported, partially_supported, no_evidence, contradictory) |
judged_claims |
List |
Per-claim verdict details |
Judged Claim Fields¶
| Field | Type | Description |
|---|---|---|
claim_text |
str |
The extracted claim text |
faithfulness_verdict |
str |
Verdict: Fully Supported, Partially Supported, No Evidence, or Contradictory |
reason |
str |
Human-readable explanation for the verdict |
Example Scenarios¶
âś… Scenario 1: Perfect Faithfulness (Score: 1.0)
FULLY_SUPPORTED
Context:
"The Apollo 11 mission launched on July 16, 1969. Neil Armstrong was the mission commander. The lunar module was named Eagle."
AI Response:
"Apollo 11 launched in July 1969 with Neil Armstrong as commander. The lunar module was called Eagle."
Analysis:
| Claim | Verdict | Weight |
|---|---|---|
| Apollo 11 launched in July 1969 | FULLY_SUPPORTED | +1.0 |
| Neil Armstrong was commander | FULLY_SUPPORTED | +1.0 |
| Lunar module was called Eagle | FULLY_SUPPORTED | +1.0 |
Final Score: 3.0 / 3 = 1.0
⚠️ Scenario 2: Partial Support (Score: 0.5)
Mixed Verdicts
Context:
"Our refund policy allows returns within 30 days. Items must be unused and in original packaging."
AI Response:
"You can return items within 30 days if unused. Refunds are processed within 24 hours."
Analysis:
| Claim | Verdict | Weight |
|---|---|---|
| Returns within 30 days if unused | FULLY_SUPPORTED | +1.0 |
| Refunds processed within 24 hours | NO_EVIDENCE | 0.0 |
Final Score: 1.0 / 2 = 0.5
In strict mode: (1.0 + -1.0) / 2 = 0.0
❌ Scenario 3: Contradiction (Score: 0.0)
CONTRADICTORY
Context:
"The maximum dosage is 500mg per day. Do not exceed this limit."
AI Response:
"You can safely take up to 1000mg daily."
Analysis:
| Claim | Verdict | Weight |
|---|---|---|
| Safe to take up to 1000mg daily | CONTRADICTORY | -1.0 |
Final Score: max(0, -1.0 / 1) = 0.0
Critical: This response could cause patient harm.
Why It Matters¶
Primary guardrail against hallucinations. Protects your brand from legal and reputational liability caused by invented facts.
Essential for high-stakes domains (legal, financial, medical) where users must trust the AI is summarizing, not creating.
Distinguishes retrieval errors (wrong docs found) from generation errors (right docs ignored).
Quick Reference¶
TL;DR
Faithfulness = Does the AI's response stick to the facts in the retrieved documents?
- Use it when: You need to ensure AI responses don't contain hallucinations
- Score interpretation: Higher = more grounded in source material
- Key config: Enable
strict_modefor zero-tolerance on uncited claims
-
API Reference
-
Related Metrics
Answer Relevancy · Context Precision · Factual Accuracy