Faithfulness¶

Measure factual consistency between AI responses and source documents
LLM-Powered Knowledge Single Turn

At a Glance¶

🎯
Score Range
0.0 ──────── 1.0
Clamped from weighted average

⚡
Default Threshold
0.5
Pass/fail cutoff

📋
Required Inputs
query actual_output retrieved_content
Three fields needed

What It Measures

Faithfulness evaluates whether every claim in the AI's response can be directly inferred from the provided source material. It acts as your primary defense against hallucinations—ensuring the AI summarizes existing knowledge rather than inventing facts.

Score	Interpretation
1.0	Every claim is fully supported by context
0.7+	Most claims supported, minor gaps
0.5	Threshold—mixture of supported and unsupported
< 0.5	Significant hallucinations or contradictions

✅ Use When

RAG systems & document Q&A
Knowledge base assistants
Summarization tasks
Any system with retrieved context

❌ Don't Use When

Creative writing / brainstorming
Opinion or preference questions
No retrieved context available
Open-ended generation tasks

Step-by-Step Process¶

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Query]
        B[Retrieved Context]
        C[AI Response]
    end

    subgraph EXTRACT["🔍 Step 1: Claim Extraction"]
        D[StatementExtractor LLM]
        E["Atomic Claims<br/><small>Self-contained, verifiable</small>"]
    end

    subgraph VERIFY["⚖️ Step 2: Verification"]
        F[FaithfulnessJudge LLM]
        G["Verdict per Claim"]
    end

    subgraph SCORE["📊 Step 3: Scoring"]
        H["Sum Weighted Verdicts"]
        I["Clamp to [0, 1]"]
        J["Final Score"]
    end

    A & B & C --> D
    D --> E
    E --> F
    B --> F
    F --> G
    G --> H
    H --> I
    I --> J

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style EXTRACT stroke:#3b82f6,stroke-width:2px
    style VERIFY stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style J fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

Each extracted claim receives a verdict with a corresponding weight. The final score is the weighted average, clamped to [0, 1].

✅ FULLY_SUPPORTED

+1.0

The claim is explicitly stated in the context. Direct evidence exists.

⚠️ PARTIALLY_SUPPORTED

+0.5

Core subject is correct but claim exaggerates certainty or has minor inaccuracies.

❓ NO_EVIDENCE

0.0

Context doesn't contain information to verify the claim. Hallucination.

❌ CONTRADICTORY

-1.0

Evidence directly contradicts the claim. Critical factual error.

Score Formula

score = max(0.0, min(1.0, sum(verdict_weights) / total_claims))

Configuration¶

Parameters Custom Weights

Parameter	Type	Default	Description
`strict_mode`	`bool`	`False`	When `True`, `NO_EVIDENCE` verdicts receive -1.0 (same as contradictions), heavily penalizing hallucinations
`verdict_scores`	`Dict[str, float]`	`None`	Custom override for verdict weights. Takes precedence over `strict_mode`
`mode`	`EvaluationMode`	`GRANULAR`	Evaluation detail level (`GRANULAR` or `HOLISTIC`)

Strict Mode

Enable strict_mode=True for high-stakes domains (legal, medical, financial) where any uncited claim is unacceptable—even if not directly contradicted.

Override default verdict weights for domain-specific calibration:

from axion.metrics import Faithfulness

# Extra penalty for contradictions, higher partial credit
metric = Faithfulness(
    verdict_scores={
        'FULLY_SUPPORTED': 1.0,
        'PARTIALLY_SUPPORTED': 0.75,  # More generous
        'NO_EVIDENCE': -0.5,          # Moderate penalty
        'CONTRADICTORY': -2.0,        # Severe penalty
    }
)

Code Examples¶

Basic Usage Strict Mode With Runner

from axion.metrics import Faithfulness
from axion.dataset import DatasetItem

# Initialize with defaults
metric = Faithfulness()

item = DatasetItem(
    query="What is the infield fly rule in baseball?",
    actual_output="The infield fly rule prevents the defense from intentionally dropping a fly ball to turn a double play.",
    retrieved_content=[
        "The infield fly rule prevents unfair advantage.",
        "Applies with runners on first and second.",
    ],
)

result = await metric.execute(item)
print(result.pretty())

from axion.metrics import Faithfulness

# Zero tolerance for hallucinations
metric = Faithfulness(strict_mode=True)

# Any NO_EVIDENCE claim now scores -1.0 instead of 0.0
# This dramatically lowers scores for responses with uncited claims

from axion.metrics import Faithfulness
from axion.runners import MetricRunner

# Initialize with strict mode
faithfulness = Faithfulness(strict_mode=True)

runner = MetricRunner(metrics=[faithfulness])
results = await runner.run(dataset)

# Access detailed breakdown
for item_result in results:
    print(f"Score: {item_result.score}")
    print(f"Claims analyzed: {item_result.data.total_claims}")
    for claim in item_result.data.judged_claims:
        print(f"  - {claim.verdict}: {claim.text}")

Metric Diagnostics¶

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown

📊 FaithfulnessResult Structure

FaithfulnessResult(
{
    "overall_score": 0.5,
    "total_claims": 2,
    "verdict_counts": {
        "fully_supported": 1,
        "partially_supported": 0,
        "no_evidence": 1,
        "contradictory": 0
    },
    "judged_claims": [
        {
            "claim_text": "The infield fly rule prevents the defense from intentionally dropping a fly ball.",
            "faithfulness_verdict": "Fully Supported",
            "reason": "The evidence states that the infield fly rule prevents the defense from intentionally dropping a catchable fly ball."
        },
        {
            "claim_text": "The infield fly rule is designed to prevent an easy double play when runners are on base.",
            "faithfulness_verdict": "No Evidence",
            "reason": "The evidence does not mention anything about preventing an easy double play when runners are on base."
        }
    ]
}
)

Signal Fields¶

Field	Type	Description
`overall_score`	`float`	The 0-1 faithfulness score
`total_claims`	`int`	Total claims extracted from the response
`verdict_counts`	`Dict`	Breakdown by verdict type (`fully_supported`, `partially_supported`, `no_evidence`, `contradictory`)
`judged_claims`	`List`	Per-claim verdict details

Judged Claim Fields¶

Field	Type	Description
`claim_text`	`str`	The extracted claim text
`faithfulness_verdict`	`str`	Verdict: `Fully Supported`, `Partially Supported`, `No Evidence`, or `Contradictory`
`reason`	`str`	Human-readable explanation for the verdict

Example Scenarios¶

✅ Scenario 1: Perfect Faithfulness (Score: 1.0)

FULLY_SUPPORTED

Context:

"The Apollo 11 mission launched on July 16, 1969. Neil Armstrong was the mission commander. The lunar module was named Eagle."

AI Response:

"Apollo 11 launched in July 1969 with Neil Armstrong as commander. The lunar module was called Eagle."

Analysis:

Claim	Verdict	Weight
Apollo 11 launched in July 1969	FULLY_SUPPORTED	+1.0
Neil Armstrong was commander	FULLY_SUPPORTED	+1.0
Lunar module was called Eagle	FULLY_SUPPORTED	+1.0

Final Score: 3.0 / 3 = 1.0

⚠️ Scenario 2: Partial Support (Score: 0.5)

Mixed Verdicts

Context:

"Our refund policy allows returns within 30 days. Items must be unused and in original packaging."

AI Response:

"You can return items within 30 days if unused. Refunds are processed within 24 hours."

Analysis:

Claim	Verdict	Weight
Returns within 30 days if unused	FULLY_SUPPORTED	+1.0
Refunds processed within 24 hours	NO_EVIDENCE	0.0

Final Score: 1.0 / 2 = 0.5

In strict mode: (1.0 + -1.0) / 2 = 0.0

❌ Scenario 3: Contradiction (Score: 0.0)

CONTRADICTORY

Context:

"The maximum dosage is 500mg per day. Do not exceed this limit."

AI Response:

"You can safely take up to 1000mg daily."

Analysis:

Claim	Verdict	Weight
Safe to take up to 1000mg daily	CONTRADICTORY	-1.0

Final Score: max(0, -1.0 / 1) = 0.0

Critical: This response could cause patient harm.

Why It Matters¶

🛡️ Risk Mitigation

Primary guardrail against hallucinations. Protects your brand from legal and reputational liability caused by invented facts.

✅ User Trust

Essential for high-stakes domains (legal, financial, medical) where users must trust the AI is summarizing, not creating.

🔍 Debug Isolation

Distinguishes retrieval errors (wrong docs found) from generation errors (right docs ignored).

Quick Reference¶

TL;DR

Faithfulness = Does the AI's response stick to the facts in the retrieved documents?

Use it when: You need to ensure AI responses don't contain hallucinations
Score interpretation: Higher = more grounded in source material
Key config: Enable strict_mode for zero-tolerance on uncited claims

API Reference

axion.metrics.Faithfulness
Related Metrics

Answer Relevancy · Context Precision · Factual Accuracy