Factual Accuracy¶
LLM-Powered Knowledge Single Turn
At a Glance¶
Score Range
0.0 ──────── 1.0Ratio of supported statements
Default Threshold
0.8Higher bar for accuracy
Required Inputs
query actual_output expected_outputGround truth required
What It Measures
Factual Accuracy calculates the percentage of statements in the AI's response that are factually supported by the ground truth (expected_output). Unlike Faithfulness (which checks against retrieved context), this metric verifies against a known-correct answer.
| Score | Interpretation |
|---|---|
| 1.0 | Every statement matches ground truth |
| 0.8+ | Most statements accurate, minor gaps |
| 0.5 | Half the statements are unsupported |
| < 0.5 | Significant factual errors |
- You have ground truth answers
- Testing against known-correct responses
- Evaluating factual Q&A systems
- Regression testing AI outputs
- No expected_output available
- Multiple valid answers exist
- Testing creative/generative tasks
- Ground truth may be incomplete
See Also: Faithfulness
Factual Accuracy verifies against ground truth (expected_output). Faithfulness verifies against retrieved context.
Use Factual Accuracy when you have known-correct answers; use Faithfulness for RAG systems.
How It Works
The metric extracts statements from the AI response and checks each against the ground truth.
Step-by-Step Process¶
flowchart TD
subgraph INPUT["📥 Inputs"]
A[Query]
B[AI Response]
C[Expected Output]
end
subgraph EXTRACT["🔍 Step 1: Statement Extraction"]
D[Extract Statements from Response]
E["Atomic Statements"]
end
subgraph VERIFY["⚖️ Step 2: Ground Truth Check"]
F[Compare to Expected Output]
G["Supported / Not Supported"]
end
subgraph SCORE["📊 Step 3: Scoring"]
H["Count Supported"]
I["Calculate Ratio"]
J["Final Score"]
end
A & B & C --> D
D --> E
E --> F
C --> F
F --> G
G --> H
H --> I
I --> J
style INPUT stroke:#1E3A5F,stroke-width:2px
style EXTRACT stroke:#3b82f6,stroke-width:2px
style VERIFY stroke:#f59e0b,stroke-width:2px
style SCORE stroke:#10b981,stroke-width:2px
style J fill:#1E3A5F,stroke:#0F2440,stroke-width:3px,color:#fff
Each statement receives a binary verdict—either supported or not supported by the ground truth.
Statement is factually consistent with the ground truth.
Statement is not found or contradicts the ground truth.
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
mode |
EvaluationMode |
GRANULAR |
Evaluation detail level |
Simple Configuration
Factual Accuracy has minimal configuration—it focuses on binary correctness against ground truth.
Code Examples¶
from axion.metrics import FactualAccuracy
from axion.dataset import DatasetItem
metric = FactualAccuracy()
item = DatasetItem(
query="What is the capital of France?",
actual_output="Paris is the capital of France. It has a population of about 2 million.",
expected_output="Paris is the capital of France. The city has approximately 2.1 million inhabitants.",
)
result = await metric.execute(item)
print(result.pretty())
from axion.metrics import FactualAccuracy
from axion.runners import MetricRunner
metric = FactualAccuracy()
runner = MetricRunner(metrics=[metric])
results = await runner.run(dataset)
for item_result in results:
print(f"Score: {item_result.score}")
for verdict in item_result.signals.verdicts:
status = "✅" if verdict.is_supported else "❌"
print(f" {status} {verdict.statement}")
Metric Diagnostics¶
Every evaluation is fully interpretable. Access detailed diagnostic results via result.signals to understand exactly why a score was given—no black boxes.
result = await metric.execute(item)
print(result.pretty()) # Human-readable summary
result.signals # Full diagnostic breakdown
📊 FactualityReport Structure
FactualityReport(
{
"verdicts": [
{
"statement": "Paris is the capital of France.",
"is_supported": 1,
"reason": "The ground truth confirms Paris is the capital of France."
},
{
"statement": "It has a population of about 2 million.",
"is_supported": 1,
"reason": "The ground truth states approximately 2.1 million, which aligns with 'about 2 million'."
}
]
}
)
Signal Fields¶
| Field | Type | Description |
|---|---|---|
verdicts |
List[StatementVerdict] |
Per-statement verdicts |
Statement Verdict Fields¶
| Field | Type | Description |
|---|---|---|
statement |
str |
The extracted statement |
is_supported |
int |
1 = supported, 0 = not supported |
reason |
str |
Explanation for the verdict |
Example Scenarios¶
✅ Scenario 1: Perfect Accuracy (Score: 1.0)
All Statements Supported
Query:
"What year did World War II end?"
Expected Output:
"World War II ended in 1945. Germany surrendered in May, and Japan in September."
AI Response:
"World War II ended in 1945. Germany surrendered in May, Japan in September."
Analysis:
| Statement | Verdict | Score |
|---|---|---|
| World War II ended in 1945 | SUPPORTED | 1 |
| Germany surrendered in May | SUPPORTED | 1 |
| Japan surrendered in September | SUPPORTED | 1 |
Final Score: 3 / 3 = 1.0
⚠️ Scenario 2: Partial Accuracy (Score: 0.67)
Mixed Verdicts
Query:
"What is the speed of light?"
Expected Output:
"The speed of light is approximately 299,792 km/s in a vacuum."
AI Response:
"The speed of light is about 300,000 km/s. It travels slower through water. Light is the fastest thing in the universe."
Analysis:
| Statement | Verdict | Score |
|---|---|---|
| Speed of light is about 300,000 km/s | SUPPORTED | 1 |
| It travels slower through water | NOT SUPPORTED | 0 |
| Light is the fastest thing in the universe | NOT SUPPORTED | 0 |
Final Score: 1 / 3 = 0.33
The ground truth only mentions vacuum speed—other claims are unsupported.
❌ Scenario 3: Poor Accuracy (Score: 0.0)
No Statements Supported
Query:
"Who wrote Romeo and Juliet?"
Expected Output:
"Romeo and Juliet was written by William Shakespeare in the 1590s."
AI Response:
"Romeo and Juliet was written by Christopher Marlowe in 1610."
Analysis:
| Statement | Verdict | Score |
|---|---|---|
| Written by Christopher Marlowe | NOT SUPPORTED | 0 |
| Written in 1610 | NOT SUPPORTED | 0 |
Final Score: 0 / 2 = 0.0
Both claims contradict the ground truth.
Why It Matters¶
When you have known-correct answers, this metric tells you exactly how well your AI matches reality.
Track factual accuracy over time as you update models or prompts. Catch regressions before deployment.
Compare different models or configurations using the same ground truth dataset.
Quick Reference¶
TL;DR
Factual Accuracy = Does the AI's response match the known-correct answer?
- Use it when: You have ground truth (expected_output) to compare against
- Score interpretation: Higher = more statements verified against ground truth
- Key difference: Compares to expected_output, not retrieved_content
-
API Reference
-
Related Metrics
Faithfulness · Answer Completeness · Answer Relevancy