Skip to content

Faithfulness

Measure factual consistency between AI responses and source documents
LLM-Powered Knowledge Single Turn

At a Glance

🎯
Score Range
0.0 ──────── 1.0
Clamped from weighted average
⚡
Default Threshold
0.5
Pass/fail cutoff
đź“‹
Required Inputs
query actual_output retrieved_content
Three fields needed

What It Measures

Faithfulness evaluates whether every claim in the AI's response can be directly inferred from the provided source material. It acts as your primary defense against hallucinations—ensuring the AI summarizes existing knowledge rather than inventing facts.

Score Interpretation
1.0 Every claim is fully supported by context
0.7+ Most claims supported, minor gaps
0.5 Threshold—mixture of supported and unsupported
< 0.5 Significant hallucinations or contradictions
âś… Use When
  • RAG systems & document Q&A
  • Knowledge base assistants
  • Summarization tasks
  • Any system with retrieved context
❌ Don't Use When
  • Creative writing / brainstorming
  • Opinion or preference questions
  • No retrieved context available
  • Open-ended generation tasks

See Also: Answer Relevancy

Faithfulness checks if claims are grounded in the source context (factual accuracy). Answer Relevancy checks if statements address the user's query (topical alignment).

Use both together for comprehensive RAG evaluation.


How It Works

The metric uses an Evaluator LLM to decompose the response into atomic claims, then verify each against the retrieved context.

Step-by-Step Process

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Query]
        B[Retrieved Context]
        C[AI Response]
    end

    subgraph EXTRACT["🔍 Step 1: Claim Extraction"]
        D[StatementExtractor LLM]
        E["Atomic Claims<br/><small>Self-contained, verifiable</small>"]
    end

    subgraph VERIFY["⚖️ Step 2: Verification"]
        F[FaithfulnessJudge LLM]
        G["Verdict per Claim"]
    end

    subgraph SCORE["📊 Step 3: Scoring"]
        H["Sum Weighted Verdicts"]
        I["Clamp to [0, 1]"]
        J["Final Score"]
    end

    A & B & C --> D
    D --> E
    E --> F
    B --> F
    F --> G
    G --> H
    H --> I
    I --> J

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style EXTRACT stroke:#3b82f6,stroke-width:2px
    style VERIFY stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style J fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

Each extracted claim receives a verdict with a corresponding weight. The final score is the weighted average, clamped to [0, 1].

âś… FULLY_SUPPORTED
+1.0

The claim is explicitly stated in the context. Direct evidence exists.

⚠️ PARTIALLY_SUPPORTED
+0.5

Core subject is correct but claim exaggerates certainty or has minor inaccuracies.

âť“ NO_EVIDENCE
0.0

Context doesn't contain information to verify the claim. Hallucination.

❌ CONTRADICTORY
-1.0

Evidence directly contradicts the claim. Critical factual error.

Score Formula

score = max(0.0, min(1.0, sum(verdict_weights) / total_claims))

Configuration

Parameter Type Default Description
strict_mode bool False When True, NO_EVIDENCE verdicts receive -1.0 (same as contradictions), heavily penalizing hallucinations
verdict_scores Dict[str, float] None Custom override for verdict weights. Takes precedence over strict_mode
mode EvaluationMode GRANULAR Evaluation detail level (GRANULAR or HOLISTIC)

Strict Mode

Enable strict_mode=True for high-stakes domains (legal, medical, financial) where any uncited claim is unacceptable—even if not directly contradicted.

Override default verdict weights for domain-specific calibration:

from axion.metrics import Faithfulness

# Extra penalty for contradictions, higher partial credit
metric = Faithfulness(
    verdict_scores={
        'FULLY_SUPPORTED': 1.0,
        'PARTIALLY_SUPPORTED': 0.75,  # More generous
        'NO_EVIDENCE': -0.5,          # Moderate penalty
        'CONTRADICTORY': -2.0,        # Severe penalty
    }
)

Code Examples

from axion.metrics import Faithfulness
from axion.dataset import DatasetItem

# Initialize with defaults
metric = Faithfulness()

item = DatasetItem(
    query="What is the infield fly rule in baseball?",
    actual_output="The infield fly rule prevents the defense from intentionally dropping a fly ball to turn a double play.",
    retrieved_content=[
        "The infield fly rule prevents unfair advantage.",
        "Applies with runners on first and second.",
    ],
)

result = await metric.execute(item)
print(result.pretty())
from axion.metrics import Faithfulness

# Zero tolerance for hallucinations
metric = Faithfulness(strict_mode=True)

# Any NO_EVIDENCE claim now scores -1.0 instead of 0.0
# This dramatically lowers scores for responses with uncited claims
from axion.metrics import Faithfulness
from axion.runners import MetricRunner

# Initialize with strict mode
faithfulness = Faithfulness(strict_mode=True)

runner = MetricRunner(metrics=[faithfulness])
results = await runner.run(dataset)

# Access detailed breakdown
for item_result in results:
    print(f"Score: {item_result.score}")
    print(f"Claims analyzed: {item_result.data.total_claims}")
    for claim in item_result.data.judged_claims:
        print(f"  - {claim.verdict}: {claim.text}")

Metric Diagnostics

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown
📊 FaithfulnessResult Structure
FaithfulnessResult(
{
    "overall_score": 0.5,
    "total_claims": 2,
    "verdict_counts": {
        "fully_supported": 1,
        "partially_supported": 0,
        "no_evidence": 1,
        "contradictory": 0
    },
    "judged_claims": [
        {
            "claim_text": "The infield fly rule prevents the defense from intentionally dropping a fly ball.",
            "faithfulness_verdict": "Fully Supported",
            "reason": "The evidence states that the infield fly rule prevents the defense from intentionally dropping a catchable fly ball."
        },
        {
            "claim_text": "The infield fly rule is designed to prevent an easy double play when runners are on base.",
            "faithfulness_verdict": "No Evidence",
            "reason": "The evidence does not mention anything about preventing an easy double play when runners are on base."
        }
    ]
}
)

Signal Fields

Field Type Description
overall_score float The 0-1 faithfulness score
total_claims int Total claims extracted from the response
verdict_counts Dict Breakdown by verdict type (fully_supported, partially_supported, no_evidence, contradictory)
judged_claims List Per-claim verdict details

Judged Claim Fields

Field Type Description
claim_text str The extracted claim text
faithfulness_verdict str Verdict: Fully Supported, Partially Supported, No Evidence, or Contradictory
reason str Human-readable explanation for the verdict

Example Scenarios

âś… Scenario 1: Perfect Faithfulness (Score: 1.0)

FULLY_SUPPORTED

Context:

"The Apollo 11 mission launched on July 16, 1969. Neil Armstrong was the mission commander. The lunar module was named Eagle."

AI Response:

"Apollo 11 launched in July 1969 with Neil Armstrong as commander. The lunar module was called Eagle."

Analysis:

Claim Verdict Weight
Apollo 11 launched in July 1969 FULLY_SUPPORTED +1.0
Neil Armstrong was commander FULLY_SUPPORTED +1.0
Lunar module was called Eagle FULLY_SUPPORTED +1.0

Final Score: 3.0 / 3 = 1.0

⚠️ Scenario 2: Partial Support (Score: 0.5)

Mixed Verdicts

Context:

"Our refund policy allows returns within 30 days. Items must be unused and in original packaging."

AI Response:

"You can return items within 30 days if unused. Refunds are processed within 24 hours."

Analysis:

Claim Verdict Weight
Returns within 30 days if unused FULLY_SUPPORTED +1.0
Refunds processed within 24 hours NO_EVIDENCE 0.0

Final Score: 1.0 / 2 = 0.5

In strict mode: (1.0 + -1.0) / 2 = 0.0

❌ Scenario 3: Contradiction (Score: 0.0)

CONTRADICTORY

Context:

"The maximum dosage is 500mg per day. Do not exceed this limit."

AI Response:

"You can safely take up to 1000mg daily."

Analysis:

Claim Verdict Weight
Safe to take up to 1000mg daily CONTRADICTORY -1.0

Final Score: max(0, -1.0 / 1) = 0.0

Critical: This response could cause patient harm.


Why It Matters

🛡️ Risk Mitigation

Primary guardrail against hallucinations. Protects your brand from legal and reputational liability caused by invented facts.

âś… User Trust

Essential for high-stakes domains (legal, financial, medical) where users must trust the AI is summarizing, not creating.

🔍 Debug Isolation

Distinguishes retrieval errors (wrong docs found) from generation errors (right docs ignored).


Quick Reference

TL;DR

Faithfulness = Does the AI's response stick to the facts in the retrieved documents?

  • Use it when: You need to ensure AI responses don't contain hallucinations
  • Score interpretation: Higher = more grounded in source material
  • Key config: Enable strict_mode for zero-tolerance on uncited claims