Metrics API Reference¶

Evaluation metrics for AI agents — composable, LLM-powered and heuristic scoring.

from axion import metric_registry
from axion.metrics.base import BaseMetric, MetricEvaluationResult, metric
from axion.metrics import (
    Faithfulness, AnswerRelevancy, FactualAccuracy,
    AnswerCompleteness, AnswerCriteria,
    ContextualRelevancy, ContextualPrecision, ContextualRecall,
    ExactStringMatch, CitationPresence, Latency,
    HitRateAtK, MeanReciprocalRank,
    GoalCompletion, ConversationFlow,
)

B

BaseMetric

Base class for all metrics. Provides LLM integration, field validation, structured I/O, and the execute() contract.

R

MetricRegistry

Global registry for storing, retrieving, and discovering metric classes by key, tag, or compatible fields.

@

@metric

Decorator that attaches config (name, fields, threshold, tags) and auto-registers the class.

30+

Built-in Metrics

Composite (LLM-judged), heuristic, retrieval, and conversational metrics ready to use out of the box.

BaseMetric¶

axion.metrics.base.BaseMetric ¶

BaseMetric(model_name: Optional[str] = None, llm: Optional[LLMRunnable] = None, embed_model_name: Optional[str] = None, embed_model: Optional[EmbeddingRunnable] = None, threshold: float = None, llm_provider: Optional[str] = None, required_fields: Optional[List[str]] = None, optional_fields: Optional[List[str]] = None, metric_name: Optional[str] = None, metric_description: Optional[str] = None, name: Optional[str] = None, field_mapping: Optional[Dict[str, str]] = None, metric_category: Optional[MetricCategory] = None, **kwargs: Any)

Bases: LLMHandler, Generic[InputModel, OutputModel]

Base class for all metric evaluation classes, inheriting from LLMHandler.

Initialize the metric with optional LLM and embedding model.

Parameters:

model_name (Optional[str], default: None ) –

Name of the LLM model to use
llm (Optional[LLMRunnable], default: None ) –

A pre-configured LLM model. If not provided, a default is loaded from the registry.
embed_model_name (Optional[str], default: None ) –

Name of the embedding model to use
embed_model (Optional[EmbeddingRunnable], default: None ) –

A pre-configured embedding model handler (if needed).
threshold (float, default: None ) –

The threshold to consider a score as 'passing'. Will overwrite default.
llm_provider (Optional[str], default: None ) –

The LLM provider to use
required_fields (Optional[List[str]], default: None ) –

List of required field names for evaluation
optional_fields (Optional[List[str]], default: None ) –

List of optional field names for evaluation
metric_name (Optional[str], default: None ) –

Optional name for the metric instance (alias: name)
metric_description (Optional[str], default: None ) –

Optional description for the metric instance
name (Optional[str], default: None ) –

Alias for metric_name (for convenience)
field_mapping (Optional[Dict[str, str]], default: None ) –

Optional mapping from canonical field names to source paths. e.g., {'actual_output': 'additional_output.summary'} will resolve 'actual_output' from item.additional_output['summary'].
metric_category (Optional[MetricCategory], default: None ) –

The category of metric output (SCORE, ANALYSIS, CLASSIFICATION). If not provided, falls back to class config or defaults to SCORE.
**kwargs (Any, default: {} ) –

Additional keyword arguments passed to the parent LLMHandler (e.g., logger config).

name `property` ¶

name: str

Return the name of the metric from instance, config, or fallback to class name.

description `property` ¶

description: str

Return the description of the metric from instance, config, or fallback to class name.

threshold `property` ¶

threshold

Metric passing threshold.

input_item `property` ¶

input_item

Access the final DatasetItem passed the metric

required_fields `property` `writable` ¶

required_fields: list

Returns the required fields for evaluation.

Falls back to configuration if instance-level fields are not explicitly set.

optional_fields `property` `writable` ¶

optional_fields: list

Returns the optional fields for evaluation.

Falls back to configuration if instance-level fields are not explicitly set.

metric_category `property` `writable` ¶

metric_category: MetricCategory

Returns the metric category for this metric.

Falls back to configuration if instance-level value is not explicitly set. Defaults to MetricCategory.SCORE if not defined anywhere.

execute `async` ¶

execute(item: Union[DatasetItem, dict], callbacks: Callbacks = None, **kwargs) -> MetricEvaluationResult

Execute the metric evaluation for a single dataset item.

Parameters:

item (Union[DatasetItem, dict]) –

Input dataset item containing necessary fields for evaluation.
callbacks (Callbacks, default: None ) –

Optional callback handler for events/logging.

Returns:

MetricEvaluationResult –

An evaluation result conforming to the output model.

get_field ¶

get_field(item: DatasetItem, field_name: str, default: Any = None) -> Any

Resolve a field value from DatasetItem, respecting field_mapping overrides.

If a mapping is defined for the given field_name, this method resolves the value from the mapped source path. Otherwise, it returns the attribute directly from the item.

Parameters:

item (DatasetItem) –

The DatasetItem to extract from
field_name (str) –

Canonical field name (e.g., 'actual_output')
default (Any, default: None ) –

Value to return if field is not found

Returns:

Any –

The resolved field value

Example

With field_mapping={'actual_output': 'additional_output.summary'}¶

value = self.get_field(item, 'actual_output') # Gets item.additional_output['summary']

get_mapped_fields ¶

get_mapped_fields(item: DatasetItem) -> Dict[str, Any]

Return all required and optional fields with resolved values.

This convenience method resolves all configured fields (both required and optional) from the DatasetItem, applying any field mappings.

Parameters:

item (DatasetItem) –

The DatasetItem to extract fields from

Returns:

Dict[str, Any] –

Dictionary mapping field names to their resolved values

get_evaluation_fields ¶

get_evaluation_fields(item: Union[DatasetItem, dict]) -> Union[DatasetItem, InputModel]

Extracts the appropriate evaluation fields from the dataset item.

Priority is given to explicitly set required and optional fields on the instance. If not defined, configuration-based fields are used. If none are available, the item's default evaluation fields are returned.

Parameters:

item (DatasetItem) –

The input dataset item.

Returns:

Union[DatasetItem, InputModel] –

DatasetItem | InputModel: A dataset item containing only the relevant fields for evaluation.

set_instruction ¶

set_instruction(instruction: str)

Set a new instruction string for the metric.

Parameters:

instruction (str) –

The updated task instruction that guides the metric’s behavior or LLM prompt.

set_examples ¶

set_examples(examples: List[Tuple[DatasetItem, MetricEvaluationResult]])

Replace all current examples with a new set.

Parameters:

examples (List[Tuple[DatasetItem, EvaluationResult]]) –

A list of example input-output pairs used for few-shot prompting or metric calibration.

Example

[ ( DatasetItem( expected_output='....', actual_output='...', ), MetricEvaluationResult( score=..., explanation="...", ), ), ]

add_examples ¶

add_examples(examples: List[Tuple[DatasetItem, MetricEvaluationResult]])

Add new example input-output pairs to the existing list of examples.

Parameters:

examples (List[Tuple[DatasetItem, EvaluationResult]]) –

One or more examples to add to the current list, extending few-shot prompting context.

Example

[ ( DatasetItem( expected_output='....', actual_output='...', ), MetricEvaluationResult( score=..., explanation="...", ), ), ]

compute_cost_estimate ¶

compute_cost_estimate(sub_models: List[BaseMetric])

Computes the total estimated cost from sub-models for this execution.

Parameters:

sub_models (List[BaseMetric]) –

List of sub-models that may have a cost_estimate.

display_prompt ¶

display_prompt(item: Union[dict, InputModel] = None, **kwargs)

Displays the fully constructed prompt that will be sent to the LLM.

Parameters:

item (Union[dict, InputModel], default: None ) –

The input data to be included in the prompt. If None, a placeholder is used. Defaults to None.

get_sub_metrics ¶

get_sub_metrics(result: MetricEvaluationResult) -> List[SubMetricResult]

Override to define how results explode into sub-metrics.

This method is called when is_multi_metric=True to extract individual sub-metric scores from a single evaluation result. The default implementation returns an empty list, meaning no explosion occurs.

Parameters:

result (MetricEvaluationResult) –

The evaluation result from execute() containing signals and metadata.

Returns:

List[SubMetricResult] –

List of SubMetricResult objects representing individual sub-metrics.
List[SubMetricResult] –

Returns empty list by default (no explosion).

Example

def get_sub_metrics(self, result: MetricEvaluationResult) -> List[SubMetricResult]: signals = result.signals if not signals: return []

return [
    SubMetricResult(
        name='engagement',
        score=signals.engagement_score,
        group='behavioral',
    ),
    SubMetricResult(
        name='sentiment',
        score=signals.sentiment_score,
        group='sentiment',
        threshold=0.5,
    ),
]

MetricRegistry¶

axion.metrics.MetricRegistry ¶

Registry for storing and retrieving metric classes.

register ¶

register(metric_class: Type[BaseMetric]) -> None

Register a metric class into the registry.

Parameters:

metric_class (Type[BaseMetric]) –

A class inheriting from BaseMetric with a valid config.

get ¶

get(key: str, error: bool = True) -> Optional[Type[BaseMetric]]

Retrieve a registered metric class by key.

Parameters:

key (str) –

The unique key of the metric.
error (bool, default: True ) –

If True, raise an error if the key is not found. If False, return None instead.

Returns:

Optional[Type[BaseMetric]] –

The registered metric class, or None if not found and error=False.

find ¶

find(query: str) -> List[Type[BaseMetric]]

Search for metrics whose name, description, or tags match a query.

Parameters:

query (str) –

Case-insensitive search string.

Returns:

List[Type[BaseMetric]] –

A list of matching metric classes.

get_compatible_metrics ¶

get_compatible_metrics(item: DatasetItem) -> List[Type[BaseMetric]]

Return all metrics compatible with a given DatasetItem.

Parameters:

item (DatasetItem) –

The dataset item to test against.

Returns:

List[Type[BaseMetric]] –

A list of compatible metric classes.

get_metric_descriptions ¶

get_metric_descriptions() -> Dict[str, str]

Return {metric_name: description} from the registry.

display ¶

display(show_examples: bool = False) -> None

Display a summary of all registered metrics.

Parameters:

show_examples (bool, default: False ) –

Show custom LLM examples

display_table ¶

display_table() -> None

Display a formatted table of all registered metrics.

@metric decorator¶

axion.metrics.base.metric ¶

metric(name: str, description: str, required_fields: List[str], optional_fields: Optional[List[str]] = None, key: Optional[str] = None, metric_category: MetricCategory = SCORE, default_threshold: Optional[float] = 0.5, score_range: Optional[tuple[Union[int, float], Union[int, float]]] = (0, 1), tags: Optional[List[str]] = None) -> Callable[[Type[BaseMetric]], Type[BaseMetric]]

Decorator to define and register a metric class with declarative configuration.

Parameters:

name (str) –

Human-readable name of the metric.
description (str) –

Description of what the metric measures.
required_fields (List[str]) –

Fields that must be present in the DatasetItem to evaluate this metric.
optional_fields (Optional[List[str]], default: None ) –

Optional fields the metric may use if available.
key (Optional[str], default: None ) –

Optional. A unique programmatic identifier for the metric. If not provided, it's generated from the name.
metric_category (MetricCategory, default: SCORE ) –

The category of metric output: SCORE (numeric), ANALYSIS (structured insights), or CLASSIFICATION (labels). Defaults to SCORE.
default_threshold (Optional[float], default: 0.5 ) –

The default threshold to consider a score as 'passing'. Optional for ANALYSIS metrics.
score_range (Optional[tuple[Union[int, float], Union[int, float]]], default: (0, 1) ) –

Tuple representing the valid score range for this metric. Optional for ANALYSIS metrics.
tags (Optional[List[str]], default: None ) –

Searchable tags to group or filter metrics.

Returns:

Callable[[Type[BaseMetric]], Type[BaseMetric]] –

A class decorator that attaches config and registers the metric in the MetricRegistry.

Raises:

TypeError –

If the decorated class is not a subclass of BaseMetric.

Composite Metrics¶

Faithfulness¶

axion.metrics.Faithfulness ¶

Faithfulness(mode: EvaluationMode = GRANULAR, strict_mode: bool = False, verdict_scores: Optional[Dict[str, float]] = None, **kwargs)

Bases: BaseMetric

Measures how faithful the generated answer is to the retrieved context. Default Scoring: Fully Supported (+1.0), Partially Supported (+0.5), No Evidence (0.0), Contradictory (-1.0)

Initializes the Faithfulness metric. Args: mode: The evaluation mode for the underlying RAG analyzer. strict_mode (bool): If True, sets 'NO_EVIDENCE' to -1.0, penalizing uncited claims (hallucinations) as heavily as contradictions. This is overridden by 'verdict_scores' if provided. verdict_scores: A dictionary to override the default scoring weights (e.g., {"CONTRADICTORY": -2.0, "PARTIALLY_SUPPORTED": 0.75}). If provided, this takes precedence over 'strict_mode'. **kwargs: Additional keyword arguments passed to the RAGAnalyzer.

AnswerRelevancy¶

axion.metrics.AnswerRelevancy ¶

AnswerRelevancy(relevancy_mode: Literal['strict', 'task'] = 'task', penalize_ambiguity: bool = False, mode: EvaluationMode = GRANULAR, multi_turn_strategy: Literal['last_turn', 'all_turns'] = 'last_turn', **kwargs)

Bases: BaseMetric

Computes answer relevancy scores by analyzing how well the response addresses the input query. Supports both single-turn and multi-turn (all turns) evaluation.

Initialize the Answer Relevancy metric.

Parameters:

relevancy_mode (Literal['strict', 'task'], default: 'task' ) –

The mode for judging relevancy. 'strict': Only directly answering statements are relevant. 'task': Closely related, helpful statements are also relevant (default).
penalize_ambiguity (bool, default: False ) –

If True, 'idk' verdicts are scored as 0.0 (irrelevant). If False (default), 'idk' verdicts are scored as 1.0 (relevant).
mode (EvaluationMode, default: GRANULAR ) –

The evaluation mode for the internal RAGAnalyzer.
multi_turn_strategy (Literal, default: 'last_turn' ) –

How to handle multi-turn conversations. 'last_turn' (default): Evaluates only the last turn. 'all_turns': Evaluates all Human->AI turns in the conversation.
**kwargs –

Additional arguments passed to parent class.

execute `async` ¶

execute(item: DatasetItem, cache: Optional[AnalysisCache] = None) -> MetricEvaluationResult

Compute the score based on criteria. Automatically handles single-turn or multi-turn evaluation based on self.multi_turn_strategy and item.conversation.

FactualAccuracy¶

axion.metrics.FactualAccuracy ¶

FactualAccuracy(**kwargs)

Bases: BaseMetric

Factual Accuracy Metric.

Process: 1. Decompose 'actual_output' into atomic statements. 2. specific binary check (1/0) for each statement against 'expected_output'. 3. Score = (Sum of 1s) / (Total Statements).

get_signals `staticmethod` ¶

get_signals(report: FactualityReport) -> List[SignalDescriptor]

Display the binary checklist in the UI.

AnswerCompleteness¶

axion.metrics.AnswerCompleteness ¶

AnswerCompleteness(use_expected_output: bool = True, **kwargs)

Bases: BaseMetric

Evaluates the completeness of product-related responses using one of two approaches: 1. Aspect-based evaluation (when expected_aspects are provided) 2. Sub-question based evaluation (when expected_aspects are not provided)

Initialize the answer completeness metric with required prompts for both approaches.

Parameters:

use_expected_output (bool, default: True ) –

It True, use expected answer if available, otherwise decompose query

execute `async` ¶

execute(item: DatasetItem, **kwargs) -> MetricEvaluationResult

Compute the completeness score, returning a structured result object in the signals field.

get_signals `staticmethod` ¶

get_signals(result: AnswerCompletenessResult) -> List[SignalDescriptor[AnswerCompletenessResult]]

Generates a list of detailed signals from the evaluation result.

AnswerCriteria¶

axion.metrics.AnswerCriteria ¶

AnswerCriteria(criteria_key: str = 'Complete', scoring_strategy: Literal['concept', 'aspect', 'weighted'] = 'concept', check_for_contradictions: bool = False, weighted_concept_score_weight: float = 0.7, multi_turn_strategy: Literal['last_turn', 'all_turns'] = 'last_turn', multi_turn_aggregation: Literal['cumulative', 'average'] = 'cumulative', **kwargs)

Bases: BaseMetric

Evaluates responses based on specified criteria. This metric extracts key aspects from the criteria and checks whether each aspect is adequately and accurately addressed in the response.

It supports two modes of operation: 1. Single-Turn / Last-Turn (default): Evaluates item.query vs. item.actual_output. If item.conversation is present, item.query and item.actual_output are auto-populated from the last turn (based on conversation_extraction_strategy). 2. Multi-Turn: If multi_turn_strategy='all_turns', this metric will iterate through the entire item.conversation and evaluate every HumanMessage -> AIMessage pair. The aggregation method is controlled by multi_turn_aggregation.

Initialize the criteria-based answer criteria metric.

Parameters:

criteria_key (str, default: 'Complete' ) –

The key in additional_input or conversation.rubrics to find the criteria text (default: 'Complete').
scoring_strategy (Literal['concept', 'aspect', 'weighted'], default: 'concept' ) –

The scoring method: 'concept', 'aspect', or 'weighted' (default: 'concept').
check_for_contradictions (bool, default: False ) –

If True, uses a stricter prompt to penalize contradictions (default: False).
weighted_concept_score_weight (float, default: 0.7 ) –

The weight for the concept score in 'weighted' strategy (default: 0.7).
multi_turn_strategy (Literal['last_turn', 'all_turns'], default: 'last_turn' ) –

How to handle multi-turn conversations. 'last_turn' (default): Evaluates only the last turn. 'all_turns': Evaluates all Human->AI turns in the conversation.
multi_turn_aggregation (Literal['cumulative', 'average'], default: 'cumulative' ) –

Aggregation method for 'all_turns' strategy. 'cumulative' (default): Scores unique aspects covered across all turns. 'average': Scores average aspect coverage per turn.
**kwargs –

Additional arguments passed to parent class

execute `async` ¶

execute(item: DatasetItem, **kwargs) -> MetricEvaluationResult

Compute the score based on criteria. Automatically handles single-turn or multi-turn evaluation based on self.multi_turn_strategy and item.conversation.

get_signals ¶

get_signals(result: AnswerCriteriaResult) -> List[SignalDescriptor[AnswerCriteriaResult]]

Generates a list of detailed signals from the evaluation result that explain the scoring.

ContextualRelevancy¶

axion.metrics.ContextualRelevancy ¶

ContextualRelevancy(mode: EvaluationMode = GRANULAR, **kwargs)

Bases: BaseMetric

Measures the relevancy of retrieval (how much retrieved content is relevant).

get_signals ¶

get_signals(result: ContextualRelevancyResult) -> List[SignalDescriptor[ContextualRelevancyResult]]

Defines the explainable signals for the ContextualRelevancy metric.

ContextualPrecision¶

axion.metrics.ContextualPrecision ¶

ContextualPrecision(mode: EvaluationMode = GRANULAR, **kwargs)

Bases: BaseMetric

Measures the quality of retrieval ranking using Mean Average Precision (MAP).

get_signals ¶

get_signals(result: ContextualPrecisionResult) -> List[SignalDescriptor[ContextualPrecisionResult]]

Defines the explainable signals for the ContextualRanking metric.

ContextualRecall¶

axion.metrics.ContextualRecall ¶

ContextualRecall(mode: EvaluationMode = GRANULAR, **kwargs)

Bases: BaseMetric

Measures recall of retrieval (how much of the expected answer is in context).

get_signals ¶

get_signals(result: ContextualRecallResult) -> List[SignalDescriptor[ContextualRecallResult]]

Defines the explainable signals for the ContextualRecall metric.

Heuristic Metrics¶

ExactStringMatch¶

axion.metrics.ExactStringMatch ¶

ExactStringMatch(model_name: Optional[str] = None, llm: Optional[LLMRunnable] = None, embed_model_name: Optional[str] = None, embed_model: Optional[EmbeddingRunnable] = None, threshold: float = None, llm_provider: Optional[str] = None, required_fields: Optional[List[str]] = None, optional_fields: Optional[List[str]] = None, metric_name: Optional[str] = None, metric_description: Optional[str] = None, name: Optional[str] = None, field_mapping: Optional[Dict[str, str]] = None, metric_category: Optional[MetricCategory] = None, **kwargs: Any)

Bases: BaseMetric

execute `async` ¶

execute(item: DatasetItem, **kwargs) -> MetricEvaluationResult

Returns 1.0 if the actual output exactly matches the expected output (after stripping). Returns 0.0 otherwise.

CitationPresence¶

axion.metrics.CitationPresence ¶

CitationPresence(mode: str = 'any_citation', strict: bool = False, embed_model: Optional[EmbeddingRunnable] = None, use_semantic_search: bool = False, resource_similarity_threshold: float = 0.8, custom_resource_phrases: Optional[List[str]] = None, **kwargs)

Bases: BaseMetric

A metric to evaluate if the response includes properly formatted citations, supporting single-turn or multi-turn conversations.

Initialize the Citation Presence metric.

Parameters:

mode (str, default: 'any_citation' ) –

Evaluation mode - "any_citation" or "resource_section".
strict (bool, default: False ) –

If True, validates that found URLs are live by making a HEAD request.
embed_model (Optional[EmbeddingRunnable], default: None ) –

Embedding model for semantic similarity.
use_semantic_search (bool, default: False ) –

If True, uses the embedding model as a fallback.
resource_similarity_threshold (float, default: 0.8 ) –

Threshold for semantic similarity.
custom_resource_phrases (Optional[List[str]], default: None ) –

Custom phrases to look for when detecting resource sections.
**kwargs –

Additional arguments passed to BaseMetric.

execute `async` ¶

execute(item: DatasetItem, **kwargs) -> MetricEvaluationResult

Evaluate citation presence for the entire conversation or single-turn response.

get_signals ¶

get_signals(result: CitationPresenceResult) -> List[SignalDescriptor[CitationPresenceResult]]

Generates detailed signals from the presence evaluation.

Latency¶

axion.metrics.Latency ¶

Latency(normalize: bool = False, normalization_method: str = 'exponential', **kwargs)

Bases: BaseMetric

Initialize the Latency metric.

Parameters:

normalize (bool, default: False ) –

If True, normalize latency scores to [0, 1] range.
normalization_method (str, default: 'exponential' ) –

Method for normalization. Options: - 'exponential': Uses exp(-latency/threshold) for smooth decay - 'sigmoid': Uses 1/(1 + exp((latency-threshold)/scale)) for S-curve - 'reciprocal': Uses threshold/(threshold + latency) for hyperbolic decay - 'linear': Uses max(0, 1 - latency/threshold) for linear decay
**kwargs –

Additional arguments passed to the base metric.

execute `async` ¶

execute(item: DatasetItem, **kwargs) -> MetricEvaluationResult

Returns the latency recorded for a given test case as the metric score.

This metric assumes the latency field is already populated on the DatasetItem and returns it as-is or normalized based on the initialization parameters.

Parameters:

item (DatasetItem) –

The evaluation data point containing latency information.

Returns:

MetricEvaluationResult ( MetricEvaluationResult ) –

The result object containing the latency as the score.

Retrieval Metrics¶

HitRateAtK¶

axion.metrics.HitRateAtK ¶

HitRateAtK(k: Union[int, List[int]] = 10, main_k: Optional[int] = None, **kwargs)

Bases: _RetrievalMetric

Evaluates if any relevant document was retrieved in the top K results. Score is 1 if a hit is found, 0 otherwise. Now supports multiple K values.

Initialize the Hit Rate @ K metric. Args: k: The number of top results to consider, or a list of K values. main_k: The K value to use for the main metric score (defaults to max K in k_list).

get_signals ¶

get_signals(result: MultiKResult) -> List[SignalDescriptor]

Generates signals detailing the hit rate calculation for all K values.

MeanReciprocalRank¶

axion.metrics.MeanReciprocalRank ¶

MeanReciprocalRank(model_name: Optional[str] = None, llm: Optional[LLMRunnable] = None, embed_model_name: Optional[str] = None, embed_model: Optional[EmbeddingRunnable] = None, threshold: float = None, llm_provider: Optional[str] = None, required_fields: Optional[List[str]] = None, optional_fields: Optional[List[str]] = None, metric_name: Optional[str] = None, metric_description: Optional[str] = None, name: Optional[str] = None, field_mapping: Optional[Dict[str, str]] = None, metric_category: Optional[MetricCategory] = None, **kwargs: Any)

Bases: _RetrievalMetric

Calculates the Mean Reciprocal Rank (MRR). Score is 1 / (rank of first relevant item). This metric is K-independent.

get_signals ¶

get_signals(result: MeanReciprocalRankResult) -> List[SignalDescriptor]

Generates signals detailing the Mean Reciprocal Rank calculation.

Conversational Metrics¶

GoalCompletion¶

axion.metrics.GoalCompletion ¶

GoalCompletion(goal_key: str = 'goal', completion_weight: float = 0.6, efficiency_weight: float = 0.4, bottleneck_threshold: int = 5, max_clarification_penalty: float = 0.3, clarification_penalty_rate: float = 0.1, goal_drift_threshold: float = 0.3, outcome_threshold_achieved: float = 0.8, outcome_threshold_partial: float = 0.4, **kwargs)

Bases: BaseMetric

Composite metric that analyzes: 1. Goal achievement (did we complete the task?) 2. Conversation efficiency (how well did we complete it?) 3. Temporal tracking (when/how did we achieve sub-goals?)

Reuses conversation analysis from shared components to minimize LLM calls. Uses unified analysis approach to reduce LLM calls from ~3+N to ~3 total.

Initialize the Goal Completion metric.

Parameters:

goal_key (str, default: 'goal' ) –

Key in additional_input containing the user's goal
completion_weight (float, default: 0.6 ) –

Weight for pure goal achievement (default: 0.6) Rationale: Goal achievement is slightly more important than efficiency
efficiency_weight (float, default: 0.4 ) –

Weight for conversation efficiency (default: 0.4) Rationale: Efficiency matters, but achieving the goal is primary
bottleneck_threshold (int, default: 5 ) –

Number of turns before a sub-goal is flagged as bottleneck (default: 5) Rationale: Most sub-goals should resolve within 3-5 turns in efficient conversations
max_clarification_penalty (float, default: 0.3 ) –

Maximum penalty for clarifications (default: 0.3) Rationale: Excessive clarifications can reduce efficiency by up to 30%
clarification_penalty_rate (float, default: 0.1 ) –

Penalty per clarification (default: 0.1) Rationale: Each clarification represents a 10% efficiency loss
goal_drift_threshold (float, default: 0.3 ) –

Fraction of unmapped moments to trigger drift detection (default: 0.3) Rationale: If >30% of conversation is unrelated to goal, it indicates drift
outcome_threshold_achieved (float, default: 0.8 ) –

Minimum score for "achieved" outcome (default: 0.8) Rationale: 80%+ completion indicates successful goal achievement
outcome_threshold_partial (float, default: 0.4 ) –

Minimum score for "partially_achieved" outcome (default: 0.4) Rationale: 40-80% completion indicates partial success

execute `async` ¶

execute(item: DatasetItem, cache: Optional[AnalysisCache] = None) -> MetricEvaluationResult

Execute goal completion analysis using unified approach.

get_signals ¶

get_signals(result: GoalCompletionResult) -> List[SignalDescriptor[GoalCompletionResult]]

Generate comprehensive signals showing multi-layered analysis.

ConversationFlow¶

axion.metrics.ConversationFlow ¶

ConversationFlow(config: Optional[FlowConfig] = None, **kwargs)

Bases: BaseMetric

Refactored conversation flow metric with modular, testable components.

Improvements over v1: - Configurable penalties and thresholds - Enum-based issue types (no string matching) - Separate, testable detector classes - Transparent score decomposition - Better statistical methods - Comprehensive signal generation

execute `async` ¶

execute(item: DatasetItem, cache: Optional[AnalysisCache] = None) -> MetricEvaluationResult

Execute comprehensive conversation flow analysis.

get_signals ¶

get_signals(result: ConversationFlowResult) -> List[SignalDescriptor]

Generate comprehensive signals showing score calculation.

Metrics & Evaluation Guide Creating Custom Metrics Metric Registry

Metrics API Reference¶

BaseMetric¶

axion.metrics.base.BaseMetric ¶

name property ¶

description property ¶

threshold property ¶

input_item property ¶

required_fields property writable ¶

optional_fields property writable ¶

metric_category property writable ¶

execute async ¶

get_field ¶

With field_mapping={'actual_output': 'additional_output.summary'}¶

get_mapped_fields ¶

get_evaluation_fields ¶

set_instruction ¶

set_examples ¶

add_examples ¶

compute_cost_estimate ¶

display_prompt ¶

get_sub_metrics ¶

MetricRegistry¶

axion.metrics.MetricRegistry ¶

register ¶

get ¶

find ¶

get_compatible_metrics ¶

get_metric_descriptions ¶

display ¶

display_table ¶

@metric decorator¶

axion.metrics.base.metric ¶

Composite Metrics¶

Faithfulness¶

axion.metrics.Faithfulness ¶

AnswerRelevancy¶

axion.metrics.AnswerRelevancy ¶

execute async ¶

FactualAccuracy¶

axion.metrics.FactualAccuracy ¶

get_signals staticmethod ¶

AnswerCompleteness¶

axion.metrics.AnswerCompleteness ¶

execute async ¶

get_signals staticmethod ¶

AnswerCriteria¶

axion.metrics.AnswerCriteria ¶

execute async ¶

get_signals ¶

ContextualRelevancy¶

axion.metrics.ContextualRelevancy ¶

get_signals ¶

ContextualPrecision¶

axion.metrics.ContextualPrecision ¶

get_signals ¶

ContextualRecall¶

axion.metrics.ContextualRecall ¶

get_signals ¶

Heuristic Metrics¶

ExactStringMatch¶

axion.metrics.ExactStringMatch ¶

execute async ¶

CitationPresence¶

axion.metrics.CitationPresence ¶

execute async ¶

get_signals ¶

Latency¶

axion.metrics.Latency ¶

execute async ¶

Retrieval Metrics¶

HitRateAtK¶

axion.metrics.HitRateAtK ¶

get_signals ¶

MeanReciprocalRank¶

axion.metrics.MeanReciprocalRank ¶

get_signals ¶

Conversational Metrics¶

GoalCompletion¶

axion.metrics.GoalCompletion ¶

execute async ¶

name `property` ¶

description `property` ¶

threshold `property` ¶

input_item `property` ¶

required_fields `property` `writable` ¶

optional_fields `property` `writable` ¶

metric_category `property` `writable` ¶

execute `async` ¶

execute `async` ¶

get_signals `staticmethod` ¶

execute `async` ¶

get_signals `staticmethod` ¶

execute `async` ¶

execute `async` ¶

execute `async` ¶

execute `async` ¶

execute `async` ¶

execute `async` ¶