Skip to content

Metrics API Reference

Evaluation metrics for AI agents — composable, LLM-powered and heuristic scoring.

from axion import metric_registry
from axion.metrics.base import BaseMetric, MetricEvaluationResult, metric
from axion.metrics import (
    Faithfulness, AnswerRelevancy, FactualAccuracy,
    AnswerCompleteness, AnswerCriteria,
    ContextualRelevancy, ContextualPrecision, ContextualRecall,
    ExactStringMatch, CitationPresence, Latency,
    HitRateAtK, MeanReciprocalRank,
    GoalCompletion, ConversationFlow,
)
B

BaseMetric

Base class for all metrics. Provides LLM integration, field validation, structured I/O, and the execute() contract.

R

MetricRegistry

Global registry for storing, retrieving, and discovering metric classes by key, tag, or compatible fields.

@

@metric

Decorator that attaches config (name, fields, threshold, tags) and auto-registers the class.

30+

Built-in Metrics

Composite (LLM-judged), heuristic, retrieval, and conversational metrics ready to use out of the box.


BaseMetric

axion.metrics.base.BaseMetric

BaseMetric(model_name: Optional[str] = None, llm: Optional[LLMRunnable] = None, embed_model_name: Optional[str] = None, embed_model: Optional[EmbeddingRunnable] = None, threshold: float = None, llm_provider: Optional[str] = None, required_fields: Optional[List[str]] = None, optional_fields: Optional[List[str]] = None, metric_name: Optional[str] = None, metric_description: Optional[str] = None, name: Optional[str] = None, field_mapping: Optional[Dict[str, str]] = None, metric_category: Optional[MetricCategory] = None, **kwargs: Any)

Bases: LLMHandler, Generic[InputModel, OutputModel]

Base class for all metric evaluation classes, inheriting from LLMHandler.

Initialize the metric with optional LLM and embedding model.

Parameters:

  • model_name (Optional[str], default: None ) –

    Name of the LLM model to use

  • llm (Optional[LLMRunnable], default: None ) –

    A pre-configured LLM model. If not provided, a default is loaded from the registry.

  • embed_model_name (Optional[str], default: None ) –

    Name of the embedding model to use

  • embed_model (Optional[EmbeddingRunnable], default: None ) –

    A pre-configured embedding model handler (if needed).

  • threshold (float, default: None ) –

    The threshold to consider a score as 'passing'. Will overwrite default.

  • llm_provider (Optional[str], default: None ) –

    The LLM provider to use

  • required_fields (Optional[List[str]], default: None ) –

    List of required field names for evaluation

  • optional_fields (Optional[List[str]], default: None ) –

    List of optional field names for evaluation

  • metric_name (Optional[str], default: None ) –

    Optional name for the metric instance (alias: name)

  • metric_description (Optional[str], default: None ) –

    Optional description for the metric instance

  • name (Optional[str], default: None ) –

    Alias for metric_name (for convenience)

  • field_mapping (Optional[Dict[str, str]], default: None ) –

    Optional mapping from canonical field names to source paths. e.g., {'actual_output': 'additional_output.summary'} will resolve 'actual_output' from item.additional_output['summary'].

  • metric_category (Optional[MetricCategory], default: None ) –

    The category of metric output (SCORE, ANALYSIS, CLASSIFICATION). If not provided, falls back to class config or defaults to SCORE.

  • **kwargs (Any, default: {} ) –

    Additional keyword arguments passed to the parent LLMHandler (e.g., logger config).

name property

name: str

Return the name of the metric from instance, config, or fallback to class name.

description property

description: str

Return the description of the metric from instance, config, or fallback to class name.

threshold property

threshold

Metric passing threshold.

input_item property

input_item

Access the final DatasetItem passed the metric

required_fields property writable

required_fields: list

Returns the required fields for evaluation.

Falls back to configuration if instance-level fields are not explicitly set.

optional_fields property writable

optional_fields: list

Returns the optional fields for evaluation.

Falls back to configuration if instance-level fields are not explicitly set.

metric_category property writable

metric_category: MetricCategory

Returns the metric category for this metric.

Falls back to configuration if instance-level value is not explicitly set. Defaults to MetricCategory.SCORE if not defined anywhere.

execute async

execute(item: Union[DatasetItem, dict], callbacks: Callbacks = None, **kwargs) -> MetricEvaluationResult

Execute the metric evaluation for a single dataset item.

Parameters:

  • item (Union[DatasetItem, dict]) –

    Input dataset item containing necessary fields for evaluation.

  • callbacks (Callbacks, default: None ) –

    Optional callback handler for events/logging.

Returns:

  • MetricEvaluationResult

    An evaluation result conforming to the output model.

get_field

get_field(item: DatasetItem, field_name: str, default: Any = None) -> Any

Resolve a field value from DatasetItem, respecting field_mapping overrides.

If a mapping is defined for the given field_name, this method resolves the value from the mapped source path. Otherwise, it returns the attribute directly from the item.

Parameters:

  • item (DatasetItem) –

    The DatasetItem to extract from

  • field_name (str) –

    Canonical field name (e.g., 'actual_output')

  • default (Any, default: None ) –

    Value to return if field is not found

Returns:

  • Any

    The resolved field value

Example

With field_mapping={'actual_output': 'additional_output.summary'}

value = self.get_field(item, 'actual_output') # Gets item.additional_output['summary']

get_mapped_fields

get_mapped_fields(item: DatasetItem) -> Dict[str, Any]

Return all required and optional fields with resolved values.

This convenience method resolves all configured fields (both required and optional) from the DatasetItem, applying any field mappings.

Parameters:

  • item (DatasetItem) –

    The DatasetItem to extract fields from

Returns:

  • Dict[str, Any]

    Dictionary mapping field names to their resolved values

get_evaluation_fields

get_evaluation_fields(item: Union[DatasetItem, dict]) -> Union[DatasetItem, InputModel]

Extracts the appropriate evaluation fields from the dataset item.

Priority is given to explicitly set required and optional fields on the instance. If not defined, configuration-based fields are used. If none are available, the item's default evaluation fields are returned.

Parameters:

Returns:

  • Union[DatasetItem, InputModel]

    DatasetItem | InputModel: A dataset item containing only the relevant fields for evaluation.

set_instruction

set_instruction(instruction: str)

Set a new instruction string for the metric.

Parameters:

  • instruction (str) –

    The updated task instruction that guides the metric’s behavior or LLM prompt.

set_examples

set_examples(examples: List[Tuple[DatasetItem, MetricEvaluationResult]])

Replace all current examples with a new set.

Parameters:

  • examples (List[Tuple[DatasetItem, EvaluationResult]]) –

    A list of example input-output pairs used for few-shot prompting or metric calibration.

Example

[ ( DatasetItem( expected_output='....', actual_output='...', ), MetricEvaluationResult( score=..., explanation="...", ), ), ]

add_examples

add_examples(examples: List[Tuple[DatasetItem, MetricEvaluationResult]])

Add new example input-output pairs to the existing list of examples.

Parameters:

  • examples (List[Tuple[DatasetItem, EvaluationResult]]) –

    One or more examples to add to the current list, extending few-shot prompting context.

Example

[ ( DatasetItem( expected_output='....', actual_output='...', ), MetricEvaluationResult( score=..., explanation="...", ), ), ]

compute_cost_estimate

compute_cost_estimate(sub_models: List[BaseMetric])

Computes the total estimated cost from sub-models for this execution.

Parameters:

  • sub_models (List[BaseMetric]) –

    List of sub-models that may have a cost_estimate.

display_prompt

display_prompt(item: Union[dict, InputModel] = None, **kwargs)

Displays the fully constructed prompt that will be sent to the LLM.

Parameters:

  • item (Union[dict, InputModel], default: None ) –

    The input data to be included in the prompt. If None, a placeholder is used. Defaults to None.

get_sub_metrics

get_sub_metrics(result: MetricEvaluationResult) -> List[SubMetricResult]

Override to define how results explode into sub-metrics.

This method is called when is_multi_metric=True to extract individual sub-metric scores from a single evaluation result. The default implementation returns an empty list, meaning no explosion occurs.

Parameters:

  • result (MetricEvaluationResult) –

    The evaluation result from execute() containing signals and metadata.

Returns:

  • List[SubMetricResult]

    List of SubMetricResult objects representing individual sub-metrics.

  • List[SubMetricResult]

    Returns empty list by default (no explosion).

Example

def get_sub_metrics(self, result: MetricEvaluationResult) -> List[SubMetricResult]: signals = result.signals if not signals: return []

return [
    SubMetricResult(
        name='engagement',
        score=signals.engagement_score,
        group='behavioral',
    ),
    SubMetricResult(
        name='sentiment',
        score=signals.sentiment_score,
        group='sentiment',
        threshold=0.5,
    ),
]

MetricRegistry

axion.metrics.MetricRegistry

Registry for storing and retrieving metric classes.

register

register(metric_class: Type[BaseMetric]) -> None

Register a metric class into the registry.

Parameters:

  • metric_class (Type[BaseMetric]) –

    A class inheriting from BaseMetric with a valid config.

get

get(key: str, error: bool = True) -> Optional[Type[BaseMetric]]

Retrieve a registered metric class by key.

Parameters:

  • key (str) –

    The unique key of the metric.

  • error (bool, default: True ) –

    If True, raise an error if the key is not found. If False, return None instead.

Returns:

  • Optional[Type[BaseMetric]]

    The registered metric class, or None if not found and error=False.

find

find(query: str) -> List[Type[BaseMetric]]

Search for metrics whose name, description, or tags match a query.

Parameters:

  • query (str) –

    Case-insensitive search string.

Returns:

  • List[Type[BaseMetric]]

    A list of matching metric classes.

get_compatible_metrics

get_compatible_metrics(item: DatasetItem) -> List[Type[BaseMetric]]

Return all metrics compatible with a given DatasetItem.

Parameters:

  • item (DatasetItem) –

    The dataset item to test against.

Returns:

  • List[Type[BaseMetric]]

    A list of compatible metric classes.

get_metric_descriptions

get_metric_descriptions() -> Dict[str, str]

Return {metric_name: description} from the registry.

display

display(show_examples: bool = False) -> None

Display a summary of all registered metrics.

Parameters:

  • show_examples (bool, default: False ) –

    Show custom LLM examples

display_table

display_table() -> None

Display a formatted table of all registered metrics.


@metric decorator

axion.metrics.base.metric

metric(name: str, description: str, required_fields: List[str], optional_fields: Optional[List[str]] = None, key: Optional[str] = None, metric_category: MetricCategory = SCORE, default_threshold: Optional[float] = 0.5, score_range: Optional[tuple[Union[int, float], Union[int, float]]] = (0, 1), tags: Optional[List[str]] = None) -> Callable[[Type[BaseMetric]], Type[BaseMetric]]

Decorator to define and register a metric class with declarative configuration.

Parameters:

  • name (str) –

    Human-readable name of the metric.

  • description (str) –

    Description of what the metric measures.

  • required_fields (List[str]) –

    Fields that must be present in the DatasetItem to evaluate this metric.

  • optional_fields (Optional[List[str]], default: None ) –

    Optional fields the metric may use if available.

  • key (Optional[str], default: None ) –

    Optional. A unique programmatic identifier for the metric. If not provided, it's generated from the name.

  • metric_category (MetricCategory, default: SCORE ) –

    The category of metric output: SCORE (numeric), ANALYSIS (structured insights), or CLASSIFICATION (labels). Defaults to SCORE.

  • default_threshold (Optional[float], default: 0.5 ) –

    The default threshold to consider a score as 'passing'. Optional for ANALYSIS metrics.

  • score_range (Optional[tuple[Union[int, float], Union[int, float]]], default: (0, 1) ) –

    Tuple representing the valid score range for this metric. Optional for ANALYSIS metrics.

  • tags (Optional[List[str]], default: None ) –

    Searchable tags to group or filter metrics.

Returns:

  • Callable[[Type[BaseMetric]], Type[BaseMetric]]

    A class decorator that attaches config and registers the metric in the MetricRegistry.

Raises:

  • TypeError

    If the decorated class is not a subclass of BaseMetric.


Composite Metrics

Faithfulness

axion.metrics.Faithfulness

Faithfulness(mode: EvaluationMode = GRANULAR, strict_mode: bool = False, verdict_scores: Optional[Dict[str, float]] = None, **kwargs)

Bases: BaseMetric

Measures how faithful the generated answer is to the retrieved context. Default Scoring: Fully Supported (+1.0), Partially Supported (+0.5), No Evidence (0.0), Contradictory (-1.0)

Initializes the Faithfulness metric. Args: mode: The evaluation mode for the underlying RAG analyzer. strict_mode (bool): If True, sets 'NO_EVIDENCE' to -1.0, penalizing uncited claims (hallucinations) as heavily as contradictions. This is overridden by 'verdict_scores' if provided. verdict_scores: A dictionary to override the default scoring weights (e.g., {"CONTRADICTORY": -2.0, "PARTIALLY_SUPPORTED": 0.75}). If provided, this takes precedence over 'strict_mode'. **kwargs: Additional keyword arguments passed to the RAGAnalyzer.

AnswerRelevancy

axion.metrics.AnswerRelevancy

AnswerRelevancy(relevancy_mode: Literal['strict', 'task'] = 'task', penalize_ambiguity: bool = False, mode: EvaluationMode = GRANULAR, multi_turn_strategy: Literal['last_turn', 'all_turns'] = 'last_turn', **kwargs)

Bases: BaseMetric

Computes answer relevancy scores by analyzing how well the response addresses the input query. Supports both single-turn and multi-turn (all turns) evaluation.

Initialize the Answer Relevancy metric.

Parameters:

  • relevancy_mode (Literal['strict', 'task'], default: 'task' ) –

    The mode for judging relevancy. 'strict': Only directly answering statements are relevant. 'task': Closely related, helpful statements are also relevant (default).

  • penalize_ambiguity (bool, default: False ) –

    If True, 'idk' verdicts are scored as 0.0 (irrelevant). If False (default), 'idk' verdicts are scored as 1.0 (relevant).

  • mode (EvaluationMode, default: GRANULAR ) –

    The evaluation mode for the internal RAGAnalyzer.

  • multi_turn_strategy (Literal, default: 'last_turn' ) –

    How to handle multi-turn conversations. 'last_turn' (default): Evaluates only the last turn. 'all_turns': Evaluates all Human->AI turns in the conversation.

  • **kwargs

    Additional arguments passed to parent class.

execute async

execute(item: DatasetItem, cache: Optional[AnalysisCache] = None) -> MetricEvaluationResult

Compute the score based on criteria. Automatically handles single-turn or multi-turn evaluation based on self.multi_turn_strategy and item.conversation.

FactualAccuracy

axion.metrics.FactualAccuracy

FactualAccuracy(**kwargs)

Bases: BaseMetric

Factual Accuracy Metric.

Process: 1. Decompose 'actual_output' into atomic statements. 2. specific binary check (1/0) for each statement against 'expected_output'. 3. Score = (Sum of 1s) / (Total Statements).

get_signals staticmethod

get_signals(report: FactualityReport) -> List[SignalDescriptor]

Display the binary checklist in the UI.

AnswerCompleteness

axion.metrics.AnswerCompleteness

AnswerCompleteness(use_expected_output: bool = True, **kwargs)

Bases: BaseMetric

Evaluates the completeness of product-related responses using one of two approaches: 1. Aspect-based evaluation (when expected_aspects are provided) 2. Sub-question based evaluation (when expected_aspects are not provided)

Initialize the answer completeness metric with required prompts for both approaches.

Parameters:

  • use_expected_output (bool, default: True ) –

    It True, use expected answer if available, otherwise decompose query

execute async

execute(item: DatasetItem, **kwargs) -> MetricEvaluationResult

Compute the completeness score, returning a structured result object in the signals field.

get_signals staticmethod

get_signals(result: AnswerCompletenessResult) -> List[SignalDescriptor[AnswerCompletenessResult]]

Generates a list of detailed signals from the evaluation result.

AnswerCriteria

axion.metrics.AnswerCriteria

AnswerCriteria(criteria_key: str = 'Complete', scoring_strategy: Literal['concept', 'aspect', 'weighted'] = 'concept', check_for_contradictions: bool = False, weighted_concept_score_weight: float = 0.7, multi_turn_strategy: Literal['last_turn', 'all_turns'] = 'last_turn', multi_turn_aggregation: Literal['cumulative', 'average'] = 'cumulative', **kwargs)

Bases: BaseMetric

Evaluates responses based on specified criteria. This metric extracts key aspects from the criteria and checks whether each aspect is adequately and accurately addressed in the response.

It supports two modes of operation: 1. Single-Turn / Last-Turn (default): Evaluates item.query vs. item.actual_output. If item.conversation is present, item.query and item.actual_output are auto-populated from the last turn (based on conversation_extraction_strategy). 2. Multi-Turn: If multi_turn_strategy='all_turns', this metric will iterate through the entire item.conversation and evaluate every HumanMessage -> AIMessage pair. The aggregation method is controlled by multi_turn_aggregation.

Initialize the criteria-based answer criteria metric.

Parameters:

  • criteria_key (str, default: 'Complete' ) –

    The key in additional_input or conversation.rubrics to find the criteria text (default: 'Complete').

  • scoring_strategy (Literal['concept', 'aspect', 'weighted'], default: 'concept' ) –

    The scoring method: 'concept', 'aspect', or 'weighted' (default: 'concept').

  • check_for_contradictions (bool, default: False ) –

    If True, uses a stricter prompt to penalize contradictions (default: False).

  • weighted_concept_score_weight (float, default: 0.7 ) –

    The weight for the concept score in 'weighted' strategy (default: 0.7).

  • multi_turn_strategy (Literal['last_turn', 'all_turns'], default: 'last_turn' ) –

    How to handle multi-turn conversations. 'last_turn' (default): Evaluates only the last turn. 'all_turns': Evaluates all Human->AI turns in the conversation.

  • multi_turn_aggregation (Literal['cumulative', 'average'], default: 'cumulative' ) –

    Aggregation method for 'all_turns' strategy. 'cumulative' (default): Scores unique aspects covered across all turns. 'average': Scores average aspect coverage per turn.

  • **kwargs

    Additional arguments passed to parent class

execute async

execute(item: DatasetItem, **kwargs) -> MetricEvaluationResult

Compute the score based on criteria. Automatically handles single-turn or multi-turn evaluation based on self.multi_turn_strategy and item.conversation.

get_signals

get_signals(result: AnswerCriteriaResult) -> List[SignalDescriptor[AnswerCriteriaResult]]

Generates a list of detailed signals from the evaluation result that explain the scoring.

ContextualRelevancy

axion.metrics.ContextualRelevancy

ContextualRelevancy(mode: EvaluationMode = GRANULAR, **kwargs)

Bases: BaseMetric

Measures the relevancy of retrieval (how much retrieved content is relevant).

get_signals

get_signals(result: ContextualRelevancyResult) -> List[SignalDescriptor[ContextualRelevancyResult]]

Defines the explainable signals for the ContextualRelevancy metric.

ContextualPrecision

axion.metrics.ContextualPrecision

ContextualPrecision(mode: EvaluationMode = GRANULAR, **kwargs)

Bases: BaseMetric

Measures the quality of retrieval ranking using Mean Average Precision (MAP).

get_signals

get_signals(result: ContextualPrecisionResult) -> List[SignalDescriptor[ContextualPrecisionResult]]

Defines the explainable signals for the ContextualRanking metric.

ContextualRecall

axion.metrics.ContextualRecall

ContextualRecall(mode: EvaluationMode = GRANULAR, **kwargs)

Bases: BaseMetric

Measures recall of retrieval (how much of the expected answer is in context).

get_signals

get_signals(result: ContextualRecallResult) -> List[SignalDescriptor[ContextualRecallResult]]

Defines the explainable signals for the ContextualRecall metric.


Heuristic Metrics

ExactStringMatch

axion.metrics.ExactStringMatch

ExactStringMatch(model_name: Optional[str] = None, llm: Optional[LLMRunnable] = None, embed_model_name: Optional[str] = None, embed_model: Optional[EmbeddingRunnable] = None, threshold: float = None, llm_provider: Optional[str] = None, required_fields: Optional[List[str]] = None, optional_fields: Optional[List[str]] = None, metric_name: Optional[str] = None, metric_description: Optional[str] = None, name: Optional[str] = None, field_mapping: Optional[Dict[str, str]] = None, metric_category: Optional[MetricCategory] = None, **kwargs: Any)

Bases: BaseMetric

execute async

execute(item: DatasetItem, **kwargs) -> MetricEvaluationResult

Returns 1.0 if the actual output exactly matches the expected output (after stripping). Returns 0.0 otherwise.

CitationPresence

axion.metrics.CitationPresence

CitationPresence(mode: str = 'any_citation', strict: bool = False, embed_model: Optional[EmbeddingRunnable] = None, use_semantic_search: bool = False, resource_similarity_threshold: float = 0.8, custom_resource_phrases: Optional[List[str]] = None, **kwargs)

Bases: BaseMetric

A metric to evaluate if the response includes properly formatted citations, supporting single-turn or multi-turn conversations.

Initialize the Citation Presence metric.

Parameters:

  • mode (str, default: 'any_citation' ) –

    Evaluation mode - "any_citation" or "resource_section".

  • strict (bool, default: False ) –

    If True, validates that found URLs are live by making a HEAD request.

  • embed_model (Optional[EmbeddingRunnable], default: None ) –

    Embedding model for semantic similarity.

  • use_semantic_search (bool, default: False ) –

    If True, uses the embedding model as a fallback.

  • resource_similarity_threshold (float, default: 0.8 ) –

    Threshold for semantic similarity.

  • custom_resource_phrases (Optional[List[str]], default: None ) –

    Custom phrases to look for when detecting resource sections.

  • **kwargs

    Additional arguments passed to BaseMetric.

execute async

execute(item: DatasetItem, **kwargs) -> MetricEvaluationResult

Evaluate citation presence for the entire conversation or single-turn response.

get_signals

get_signals(result: CitationPresenceResult) -> List[SignalDescriptor[CitationPresenceResult]]

Generates detailed signals from the presence evaluation.

Latency

axion.metrics.Latency

Latency(normalize: bool = False, normalization_method: str = 'exponential', **kwargs)

Bases: BaseMetric

Initialize the Latency metric.

Parameters:

  • normalize (bool, default: False ) –

    If True, normalize latency scores to [0, 1] range.

  • normalization_method (str, default: 'exponential' ) –

    Method for normalization. Options: - 'exponential': Uses exp(-latency/threshold) for smooth decay - 'sigmoid': Uses 1/(1 + exp((latency-threshold)/scale)) for S-curve - 'reciprocal': Uses threshold/(threshold + latency) for hyperbolic decay - 'linear': Uses max(0, 1 - latency/threshold) for linear decay

  • **kwargs

    Additional arguments passed to the base metric.

execute async

execute(item: DatasetItem, **kwargs) -> MetricEvaluationResult

Returns the latency recorded for a given test case as the metric score.

This metric assumes the latency field is already populated on the DatasetItem and returns it as-is or normalized based on the initialization parameters.

Parameters:

  • item (DatasetItem) –

    The evaluation data point containing latency information.

Returns:

  • MetricEvaluationResult ( MetricEvaluationResult ) –

    The result object containing the latency as the score.


Retrieval Metrics

HitRateAtK

axion.metrics.HitRateAtK

HitRateAtK(k: Union[int, List[int]] = 10, main_k: Optional[int] = None, **kwargs)

Bases: _RetrievalMetric

Evaluates if any relevant document was retrieved in the top K results. Score is 1 if a hit is found, 0 otherwise. Now supports multiple K values.

Initialize the Hit Rate @ K metric. Args: k: The number of top results to consider, or a list of K values. main_k: The K value to use for the main metric score (defaults to max K in k_list).

get_signals

get_signals(result: MultiKResult) -> List[SignalDescriptor]

Generates signals detailing the hit rate calculation for all K values.

MeanReciprocalRank

axion.metrics.MeanReciprocalRank

MeanReciprocalRank(model_name: Optional[str] = None, llm: Optional[LLMRunnable] = None, embed_model_name: Optional[str] = None, embed_model: Optional[EmbeddingRunnable] = None, threshold: float = None, llm_provider: Optional[str] = None, required_fields: Optional[List[str]] = None, optional_fields: Optional[List[str]] = None, metric_name: Optional[str] = None, metric_description: Optional[str] = None, name: Optional[str] = None, field_mapping: Optional[Dict[str, str]] = None, metric_category: Optional[MetricCategory] = None, **kwargs: Any)

Bases: _RetrievalMetric

Calculates the Mean Reciprocal Rank (MRR). Score is 1 / (rank of first relevant item). This metric is K-independent.

get_signals

get_signals(result: MeanReciprocalRankResult) -> List[SignalDescriptor]

Generates signals detailing the Mean Reciprocal Rank calculation.


Conversational Metrics

GoalCompletion

axion.metrics.GoalCompletion

GoalCompletion(goal_key: str = 'goal', completion_weight: float = 0.6, efficiency_weight: float = 0.4, bottleneck_threshold: int = 5, max_clarification_penalty: float = 0.3, clarification_penalty_rate: float = 0.1, goal_drift_threshold: float = 0.3, outcome_threshold_achieved: float = 0.8, outcome_threshold_partial: float = 0.4, **kwargs)

Bases: BaseMetric

Composite metric that analyzes: 1. Goal achievement (did we complete the task?) 2. Conversation efficiency (how well did we complete it?) 3. Temporal tracking (when/how did we achieve sub-goals?)

Reuses conversation analysis from shared components to minimize LLM calls. Uses unified analysis approach to reduce LLM calls from ~3+N to ~3 total.

Initialize the Goal Completion metric.

Parameters:

  • goal_key (str, default: 'goal' ) –

    Key in additional_input containing the user's goal

  • completion_weight (float, default: 0.6 ) –

    Weight for pure goal achievement (default: 0.6) Rationale: Goal achievement is slightly more important than efficiency

  • efficiency_weight (float, default: 0.4 ) –

    Weight for conversation efficiency (default: 0.4) Rationale: Efficiency matters, but achieving the goal is primary

  • bottleneck_threshold (int, default: 5 ) –

    Number of turns before a sub-goal is flagged as bottleneck (default: 5) Rationale: Most sub-goals should resolve within 3-5 turns in efficient conversations

  • max_clarification_penalty (float, default: 0.3 ) –

    Maximum penalty for clarifications (default: 0.3) Rationale: Excessive clarifications can reduce efficiency by up to 30%

  • clarification_penalty_rate (float, default: 0.1 ) –

    Penalty per clarification (default: 0.1) Rationale: Each clarification represents a 10% efficiency loss

  • goal_drift_threshold (float, default: 0.3 ) –

    Fraction of unmapped moments to trigger drift detection (default: 0.3) Rationale: If >30% of conversation is unrelated to goal, it indicates drift

  • outcome_threshold_achieved (float, default: 0.8 ) –

    Minimum score for "achieved" outcome (default: 0.8) Rationale: 80%+ completion indicates successful goal achievement

  • outcome_threshold_partial (float, default: 0.4 ) –

    Minimum score for "partially_achieved" outcome (default: 0.4) Rationale: 40-80% completion indicates partial success

execute async

execute(item: DatasetItem, cache: Optional[AnalysisCache] = None) -> MetricEvaluationResult

Execute goal completion analysis using unified approach.

get_signals

get_signals(result: GoalCompletionResult) -> List[SignalDescriptor[GoalCompletionResult]]

Generate comprehensive signals showing multi-layered analysis.

ConversationFlow

axion.metrics.ConversationFlow

ConversationFlow(config: Optional[FlowConfig] = None, **kwargs)

Bases: BaseMetric

Refactored conversation flow metric with modular, testable components.

Improvements over v1: - Configurable penalties and thresholds - Enum-based issue types (no string matching) - Separate, testable detector classes - Transparent score decomposition - Better statistical methods - Comprehensive signal generation

execute async

execute(item: DatasetItem, cache: Optional[AnalysisCache] = None) -> MetricEvaluationResult

Execute comprehensive conversation flow analysis.

get_signals

get_signals(result: ConversationFlowResult) -> List[SignalDescriptor]

Generate comprehensive signals showing score calculation.