Skip to content

Factual Accuracy

Verify AI responses against ground truth statements
LLM-Powered Knowledge Single Turn

At a Glance

🎯
Score Range
0.0 ──────── 1.0
Ratio of supported statements

Default Threshold
0.8
Higher bar for accuracy
📋
Required Inputs
query actual_output expected_output
Ground truth required

What It Measures

Factual Accuracy calculates the percentage of statements in the AI's response that are factually supported by the ground truth (expected_output). Unlike Faithfulness (which checks against retrieved context), this metric verifies against a known-correct answer.

Score Interpretation
1.0 Every statement matches ground truth
0.8+ Most statements accurate, minor gaps
0.5 Half the statements are unsupported
< 0.5 Significant factual errors
✅ Use When
  • You have ground truth answers
  • Testing against known-correct responses
  • Evaluating factual Q&A systems
  • Regression testing AI outputs
❌ Don't Use When
  • No expected_output available
  • Multiple valid answers exist
  • Testing creative/generative tasks
  • Ground truth may be incomplete

See Also: Faithfulness

Factual Accuracy verifies against ground truth (expected_output). Faithfulness verifies against retrieved context.

Use Factual Accuracy when you have known-correct answers; use Faithfulness for RAG systems.


How It Works

The metric extracts statements from the AI response and checks each against the ground truth.

Step-by-Step Process

flowchart TD
    subgraph INPUT["📥 Inputs"]
        A[Query]
        B[AI Response]
        C[Expected Output]
    end

    subgraph EXTRACT["🔍 Step 1: Statement Extraction"]
        D[Extract Statements from Response]
        E["Atomic Statements"]
    end

    subgraph VERIFY["⚖️ Step 2: Ground Truth Check"]
        F[Compare to Expected Output]
        G["Supported / Not Supported"]
    end

    subgraph SCORE["📊 Step 3: Scoring"]
        H["Count Supported"]
        I["Calculate Ratio"]
        J["Final Score"]
    end

    A & B & C --> D
    D --> E
    E --> F
    C --> F
    F --> G
    G --> H
    H --> I
    I --> J

    style INPUT stroke:#1E3A5F,stroke-width:2px
    style EXTRACT stroke:#3b82f6,stroke-width:2px
    style VERIFY stroke:#f59e0b,stroke-width:2px
    style SCORE stroke:#10b981,stroke-width:2px
    style J fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff

Each statement receives a binary verdict—either supported or not supported by the ground truth.

✅ SUPPORTED
1

Statement is factually consistent with the ground truth.

❌ NOT SUPPORTED
0

Statement is not found or contradicts the ground truth.

Score Formula

score = supported_statements / total_statements

Configuration

Parameter Type Default Description
mode EvaluationMode GRANULAR Evaluation detail level

Simple Configuration

Factual Accuracy has minimal configuration—it focuses on binary correctness against ground truth.


Code Examples

from axion.metrics import FactualAccuracy
from axion.dataset import DatasetItem

metric = FactualAccuracy()

item = DatasetItem(
    query="What is the capital of France?",
    actual_output="Paris is the capital of France. It has a population of about 2 million.",
    expected_output="Paris is the capital of France. The city has approximately 2.1 million inhabitants.",
)

result = await metric.execute(item)
print(result.pretty())
from axion.metrics import FactualAccuracy
from axion.runners import MetricRunner

metric = FactualAccuracy()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)

for item_result in results:
    print(f"Score: {item_result.score}")
    for verdict in item_result.signals.verdicts:
        status = "✅" if verdict.is_supported else "❌"
        print(f"  {status} {verdict.statement}")

Metric Diagnostics

Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.

result = await metric.execute(item)
print(result.pretty())      # Human-readable summary
result.signals              # Full diagnostic breakdown
📊 FactualityReport Structure
FactualityReport(
{
    "verdicts": [
        {
            "statement": "Paris is the capital of France.",
            "is_supported": 1,
            "reason": "The ground truth confirms Paris is the capital of France."
        },
        {
            "statement": "It has a population of about 2 million.",
            "is_supported": 1,
            "reason": "The ground truth states approximately 2.1 million, which aligns with 'about 2 million'."
        }
    ]
}
)

Signal Fields

Field Type Description
verdicts List[StatementVerdict] Per-statement verdicts

Statement Verdict Fields

Field Type Description
statement str The extracted statement
is_supported int 1 = supported, 0 = not supported
reason str Explanation for the verdict

Example Scenarios

✅ Scenario 1: Perfect Accuracy (Score: 1.0)

All Statements Supported

Query:

"What year did World War II end?"

Expected Output:

"World War II ended in 1945. Germany surrendered in May, and Japan in September."

AI Response:

"World War II ended in 1945. Germany surrendered in May, Japan in September."

Analysis:

Statement Verdict Score
World War II ended in 1945 SUPPORTED 1
Germany surrendered in May SUPPORTED 1
Japan surrendered in September SUPPORTED 1

Final Score: 3 / 3 = 1.0

⚠️ Scenario 2: Partial Accuracy (Score: 0.67)

Mixed Verdicts

Query:

"What is the speed of light?"

Expected Output:

"The speed of light is approximately 299,792 km/s in a vacuum."

AI Response:

"The speed of light is about 300,000 km/s. It travels slower through water. Light is the fastest thing in the universe."

Analysis:

Statement Verdict Score
Speed of light is about 300,000 km/s SUPPORTED 1
It travels slower through water NOT SUPPORTED 0
Light is the fastest thing in the universe NOT SUPPORTED 0

Final Score: 1 / 3 = 0.33

The ground truth only mentions vacuum speed—other claims are unsupported.

❌ Scenario 3: Poor Accuracy (Score: 0.0)

No Statements Supported

Query:

"Who wrote Romeo and Juliet?"

Expected Output:

"Romeo and Juliet was written by William Shakespeare in the 1590s."

AI Response:

"Romeo and Juliet was written by Christopher Marlowe in 1610."

Analysis:

Statement Verdict Score
Written by Christopher Marlowe NOT SUPPORTED 0
Written in 1610 NOT SUPPORTED 0

Final Score: 0 / 2 = 0.0

Both claims contradict the ground truth.


Why It Matters

🎯 Ground Truth Validation

When you have known-correct answers, this metric tells you exactly how well your AI matches reality.

🧪 Regression Testing

Track factual accuracy over time as you update models or prompts. Catch regressions before deployment.

📊 Benchmark Evaluation

Compare different models or configurations using the same ground truth dataset.


Quick Reference

TL;DR

Factual Accuracy = Does the AI's response match the known-correct answer?

  • Use it when: You have ground truth (expected_output) to compare against
  • Score interpretation: Higher = more statements verified against ground truth
  • Key difference: Compares to expected_output, not retrieved_content