Skip to content

Composite Metrics

LLM-powered evaluation metrics for comprehensive AI response analysis
14 Metrics LLM-Powered

Composite metrics use language models to perform nuanced reasoning and analysis. These metrics evaluate complex aspects of AI responses including factual accuracy, relevance, grounding, and style—things that require understanding context, semantics, and intent.


RAG & Retrieval Metrics

Evaluate the quality of retrieval-augmented generation systems.

Faithfulness

Verify claims against retrieved context

query actual_output retrieved_content
Contextual Relevancy

Check if retrieved chunks are relevant

query retrieved_content
Contextual Recall

Check if context supports expected answer

expected_output retrieved_content
Contextual Precision

Measure useful chunk ranking (MAP)

query expected_output retrieved_content
Contextual Ranking

Check if relevant chunks rank higher

query retrieved_content
Contextual Sufficiency

Binary check for enough context

query retrieved_content
Contextual Utilization

Measure context usage efficiency

query actual_output retrieved_content

Answer Quality Metrics

Evaluate the quality and correctness of AI-generated answers.

Answer Relevancy

Check if response addresses the query

query actual_output
Factual Accuracy

Verify against ground truth

query actual_output expected_output
Answer Completeness

Check coverage of expected content

query actual_output expected_output
Answer Criteria

Evaluate against custom criteria

query actual_output + acceptance_criteria

Style & Safety Metrics

Evaluate tone, citations, and privacy compliance.

Tone & Style Consistency

Match expected voice and formatting

actual_output expected_output
Citation Relevancy

Validate citation quality

query actual_output
PII Leakage

Detect privacy violations

query actual_output

Quick Reference

Metric Score Range Threshold Key Question
Faithfulness 0.0 – 1.0 0.5 Are claims grounded in context?
Answer Relevancy 0.0 – 1.0 0.5 Does response address the query?
Factual Accuracy 0.0 – 1.0 0.8 Does it match ground truth?
Answer Completeness 0.0 – 1.0 0.5 Are all expected aspects covered?
Answer Criteria 0.0 – 1.0 0.5 Does it meet custom criteria?
Tone & Style 0.0 – 1.0 0.8 Does it match expected voice?
Citation Relevancy 0.0 – 1.0 0.8 Are citations relevant?
PII Leakage 0.0 – 1.0 0.5 Is output privacy-safe? (1.0 = safe)
Contextual Relevancy 0.0 – 1.0 0.5 Are chunks relevant to query?
Contextual Recall 0.0 – 1.0 0.5 Is expected answer in context?
Contextual Precision 0.0 – 1.0 0.5 Are useful chunks ranked first?
Contextual Ranking 0.0 – 1.0 0.5 Are relevant chunks ranked first?
Contextual Sufficiency 0.0 or 1.0 0.5 Is context sufficient? (binary)
Contextual Utilization 0.0 – 1.0 0.5 Was relevant context used?

Usage Example

from axion.metrics import (
    Faithfulness,
    AnswerRelevancy,
    ContextualPrecision,
)
from axion.runners import MetricRunner
from axion.dataset import Dataset

# Initialize metrics
metrics = [
    Faithfulness(strict_mode=True),
    AnswerRelevancy(),
    ContextualPrecision(),
]

# Run evaluation
runner = MetricRunner(metrics=metrics)
results = await runner.run(dataset)

# Analyze results
for item in results:
    print(f"Faithfulness: {item.scores['faithfulness']:.2f}")
    print(f"Relevancy: {item.scores['answer_relevancy']:.2f}")
    print(f"Precision: {item.scores['contextual_precision']:.2f}")

Choosing the Right Metrics

Evaluation Strategy

For RAG Systems:

  • Start with Faithfulness (hallucination detection)
  • Add Contextual Relevancy (retrieval quality)
  • Use Contextual Precision/Ranking (ranking quality)

For Q&A Systems:

  • Use Answer Relevancy (topical alignment)
  • Add Factual Accuracy if you have ground truth
  • Add Answer Completeness for comprehensive responses

For Customer Service:

  • Use Tone & Style Consistency (brand voice)
  • Add Answer Criteria (policy compliance)
  • Include PII Leakage (privacy protection)

For Research Assistants:

  • Use Citation Relevancy (source quality)
  • Add Faithfulness (grounding)
  • Include Answer Completeness (thoroughness)