Factual Accuracy¶

Verify AI responses against ground truth statements
LLM-Powered Knowledge Single Turn

At a Glance¶

🎯
Score Range
0.0 ──────── 1.0
Ratio of supported statements

⚡
Default Threshold
0.8
Higher bar for accuracy

📋
Required Inputs
query actual_output expected_output
Ground truth required

What It Measures

Factual Accuracy calculates the percentage of statements in the AI's response that are factually supported by the ground truth (expected_output). Unlike Faithfulness (which checks against retrieved context), this metric verifies against a known-correct answer.

Score	Interpretation
1.0	Every statement matches ground truth
0.8+	Most statements accurate, minor gaps
0.5	Half the statements are unsupported
< 0.5	Significant factual errors

✅ Use When

You have ground truth answers
Testing against known-correct responses
Evaluating factual Q&A systems
Regression testing AI outputs

❌ Don't Use When

No expected_output available
Multiple valid answers exist
Testing creative/generative tasks
Ground truth may be incomplete

Step-by-Step Process¶

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Query]
        B[AI Response]
        C[Expected Output]
    end

    subgraph EXTRACT["🔍 Step 1: Statement Extraction"]
        D[Extract Statements from Response]
        E["Atomic Statements"]
    end

    subgraph VERIFY["⚖️ Step 2: Ground Truth Check"]
        F[Compare to Expected Output]
        G["Supported / Not Supported"]
    end

    subgraph SCORE["📊 Step 3: Scoring"]
        H["Count Supported"]
        I["Calculate Ratio"]
        J["Final Score"]
    end

    A & B & C --> D
    D --> E
    E --> F
    C --> F
    F --> G
    G --> H
    H --> I
    I --> J

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style EXTRACT stroke:#3b82f6,stroke-width:2px
    style VERIFY stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style J fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

Each statement receives a binary verdict—either supported or not supported by the ground truth.

✅ SUPPORTED

1

Statement is factually consistent with the ground truth.

❌ NOT SUPPORTED

0

Statement is not found or contradicts the ground truth.

Score Formula

score = supported_statements / total_statements

Configuration¶

Parameters

Parameter	Type	Default	Description
`mode`	`EvaluationMode`	`GRANULAR`	Evaluation detail level

Simple Configuration

Factual Accuracy has minimal configuration—it focuses on binary correctness against ground truth.

Code Examples¶

Basic Usage With Runner

from axion.metrics import FactualAccuracy
from axion.dataset import DatasetItem

metric = FactualAccuracy()

item = DatasetItem(
    query="What is the capital of France?",
    actual_output="Paris is the capital of France. It has a population of about 2 million.",
    expected_output="Paris is the capital of France. The city has approximately 2.1 million inhabitants.",
)

result = await metric.execute(item)
print(result.pretty())

from axion.metrics import FactualAccuracy
from axion.runners import MetricRunner

metric = FactualAccuracy()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"Score: {item_result.score}")
    for verdict in item_result.signals.verdicts:
        status = "✅" if verdict.is_supported else "❌"
        print(f"  {status} {verdict.statement}")

Metric Diagnostics¶

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown

📊 FactualityReport Structure

FactualityReport(
{
    "verdicts": [
        {
            "statement": "Paris is the capital of France.",
            "is_supported": 1,
            "reason": "The ground truth confirms Paris is the capital of France."
        },
        {
            "statement": "It has a population of about 2 million.",
            "is_supported": 1,
            "reason": "The ground truth states approximately 2.1 million, which aligns with 'about 2 million'."
        }
    ]
}
)

Signal Fields¶

Field	Type	Description
`verdicts`	`List[StatementVerdict]`	Per-statement verdicts

Statement Verdict Fields¶

Field	Type	Description
`statement`	`str`	The extracted statement
`is_supported`	`int`	1 = supported, 0 = not supported
`reason`	`str`	Explanation for the verdict

Example Scenarios¶

✅ Scenario 1: Perfect Accuracy (Score: 1.0)

All Statements Supported

Query:

"What year did World War II end?"

Expected Output:

"World War II ended in 1945. Germany surrendered in May, and Japan in September."

AI Response:

"World War II ended in 1945. Germany surrendered in May, Japan in September."

Analysis:

Statement	Verdict	Score
World War II ended in 1945	SUPPORTED	1
Germany surrendered in May	SUPPORTED	1
Japan surrendered in September	SUPPORTED	1

Final Score: 3 / 3 = 1.0

⚠️ Scenario 2: Partial Accuracy (Score: 0.67)

Mixed Verdicts

Query:

"What is the speed of light?"

Expected Output:

"The speed of light is approximately 299,792 km/s in a vacuum."

AI Response:

"The speed of light is about 300,000 km/s. It travels slower through water. Light is the fastest thing in the universe."

Analysis:

Statement	Verdict	Score
Speed of light is about 300,000 km/s	SUPPORTED	1
It travels slower through water	NOT SUPPORTED	0
Light is the fastest thing in the universe	NOT SUPPORTED	0

Final Score: 1 / 3 = 0.33

The ground truth only mentions vacuum speed—other claims are unsupported.

❌ Scenario 3: Poor Accuracy (Score: 0.0)

No Statements Supported

Query:

"Who wrote Romeo and Juliet?"

Expected Output:

"Romeo and Juliet was written by William Shakespeare in the 1590s."

AI Response:

"Romeo and Juliet was written by Christopher Marlowe in 1610."

Analysis:

Statement	Verdict	Score
Written by Christopher Marlowe	NOT SUPPORTED	0
Written in 1610	NOT SUPPORTED	0

Final Score: 0 / 2 = 0.0

Both claims contradict the ground truth.

Why It Matters¶

🎯 Ground Truth Validation

When you have known-correct answers, this metric tells you exactly how well your AI matches reality.

🧪 Regression Testing

Track factual accuracy over time as you update models or prompts. Catch regressions before deployment.

📊 Benchmark Evaluation

Compare different models or configurations using the same ground truth dataset.

Quick Reference¶

TL;DR

Factual Accuracy = Does the AI's response match the known-correct answer?

Use it when: You have ground truth (expected_output) to compare against
Score interpretation: Higher = more statements verified against ground truth
Key difference: Compares to expected_output, not retrieved_content

API Reference

axion.metrics.FactualAccuracy
Related Metrics

Faithfulness · Answer Completeness · Answer Relevancy