Composite Metrics¶
LLM-powered evaluation metrics for comprehensive AI response analysis
14 Metrics LLM-Powered
14 Metrics LLM-Powered
Composite metrics use language models to perform nuanced reasoning and analysis. These metrics evaluate complex aspects of AI responses including factual accuracy, relevance, grounding, and style—things that require understanding context, semantics, and intent.
RAG & Retrieval Metrics¶
Evaluate the quality of retrieval-augmented generation systems.
Answer Quality Metrics¶
Evaluate the quality and correctness of AI-generated answers.
Style & Safety Metrics¶
Evaluate tone, citations, and privacy compliance.
Quick Reference¶
| Metric | Score Range | Threshold | Key Question |
|---|---|---|---|
| Faithfulness | 0.0 – 1.0 | 0.5 | Are claims grounded in context? |
| Answer Relevancy | 0.0 – 1.0 | 0.5 | Does response address the query? |
| Factual Accuracy | 0.0 – 1.0 | 0.8 | Does it match ground truth? |
| Answer Completeness | 0.0 – 1.0 | 0.5 | Are all expected aspects covered? |
| Answer Criteria | 0.0 – 1.0 | 0.5 | Does it meet custom criteria? |
| Tone & Style | 0.0 – 1.0 | 0.8 | Does it match expected voice? |
| Citation Relevancy | 0.0 – 1.0 | 0.8 | Are citations relevant? |
| PII Leakage | 0.0 – 1.0 | 0.5 | Is output privacy-safe? (1.0 = safe) |
| Contextual Relevancy | 0.0 – 1.0 | 0.5 | Are chunks relevant to query? |
| Contextual Recall | 0.0 – 1.0 | 0.5 | Is expected answer in context? |
| Contextual Precision | 0.0 – 1.0 | 0.5 | Are useful chunks ranked first? |
| Contextual Ranking | 0.0 – 1.0 | 0.5 | Are relevant chunks ranked first? |
| Contextual Sufficiency | 0.0 or 1.0 | 0.5 | Is context sufficient? (binary) |
| Contextual Utilization | 0.0 – 1.0 | 0.5 | Was relevant context used? |
Usage Example¶
from axion.metrics import (
Faithfulness,
AnswerRelevancy,
ContextualPrecision,
)
from axion.runners import MetricRunner
from axion.dataset import Dataset
# Initialize metrics
metrics = [
Faithfulness(strict_mode=True),
AnswerRelevancy(),
ContextualPrecision(),
]
# Run evaluation
runner = MetricRunner(metrics=metrics)
results = await runner.run(dataset)
# Analyze results
for item in results:
print(f"Faithfulness: {item.scores['faithfulness']:.2f}")
print(f"Relevancy: {item.scores['answer_relevancy']:.2f}")
print(f"Precision: {item.scores['contextual_precision']:.2f}")
Choosing the Right Metrics¶
Evaluation Strategy
For RAG Systems:
- Start with Faithfulness (hallucination detection)
- Add Contextual Relevancy (retrieval quality)
- Use Contextual Precision/Ranking (ranking quality)
For Q&A Systems:
- Use Answer Relevancy (topical alignment)
- Add Factual Accuracy if you have ground truth
- Add Answer Completeness for comprehensive responses
For Customer Service:
- Use Tone & Style Consistency (brand voice)
- Add Answer Criteria (policy compliance)
- Include PII Leakage (privacy protection)
For Research Assistants:
- Use Citation Relevancy (source quality)
- Add Faithfulness (grounding)
- Include Answer Completeness (thoroughness)