Metrics API Reference¶
Evaluation metrics for AI agents — composable, LLM-powered and heuristic scoring.
from axion import metric_registry
from axion.metrics.base import BaseMetric, MetricEvaluationResult, metric
from axion.metrics import (
Faithfulness, AnswerRelevancy, FactualAccuracy,
AnswerCompleteness, AnswerCriteria,
ContextualRelevancy, ContextualPrecision, ContextualRecall,
ExactStringMatch, CitationPresence, Latency,
HitRateAtK, MeanReciprocalRank,
GoalCompletion, ConversationFlow,
)
BaseMetric
Base class for all metrics. Provides LLM integration, field validation, structured I/O, and the execute() contract.
MetricRegistry
Global registry for storing, retrieving, and discovering metric classes by key, tag, or compatible fields.
@metric
Decorator that attaches config (name, fields, threshold, tags) and auto-registers the class.
Built-in Metrics
Composite (LLM-judged), heuristic, retrieval, and conversational metrics ready to use out of the box.
BaseMetric¶
axion.metrics.base.BaseMetric ¶
BaseMetric(model_name: Optional[str] = None, llm: Optional[LLMRunnable] = None, embed_model_name: Optional[str] = None, embed_model: Optional[EmbeddingRunnable] = None, threshold: float = None, llm_provider: Optional[str] = None, required_fields: Optional[List[str]] = None, optional_fields: Optional[List[str]] = None, metric_name: Optional[str] = None, metric_description: Optional[str] = None, name: Optional[str] = None, field_mapping: Optional[Dict[str, str]] = None, metric_category: Optional[MetricCategory] = None, **kwargs: Any)
Bases: LLMHandler, Generic[InputModel, OutputModel]
Base class for all metric evaluation classes, inheriting from LLMHandler.
Initialize the metric with optional LLM and embedding model.
Parameters:
-
model_name(Optional[str], default:None) –Name of the LLM model to use
-
llm(Optional[LLMRunnable], default:None) –A pre-configured LLM model. If not provided, a default is loaded from the registry.
-
embed_model_name(Optional[str], default:None) –Name of the embedding model to use
-
embed_model(Optional[EmbeddingRunnable], default:None) –A pre-configured embedding model handler (if needed).
-
threshold(float, default:None) –The threshold to consider a score as 'passing'. Will overwrite default.
-
llm_provider(Optional[str], default:None) –The LLM provider to use
-
required_fields(Optional[List[str]], default:None) –List of required field names for evaluation
-
optional_fields(Optional[List[str]], default:None) –List of optional field names for evaluation
-
metric_name(Optional[str], default:None) –Optional name for the metric instance (alias: name)
-
metric_description(Optional[str], default:None) –Optional description for the metric instance
-
name(Optional[str], default:None) –Alias for metric_name (for convenience)
-
field_mapping(Optional[Dict[str, str]], default:None) –Optional mapping from canonical field names to source paths. e.g., {'actual_output': 'additional_output.summary'} will resolve 'actual_output' from item.additional_output['summary'].
-
metric_category(Optional[MetricCategory], default:None) –The category of metric output (SCORE, ANALYSIS, CLASSIFICATION). If not provided, falls back to class config or defaults to SCORE.
-
**kwargs(Any, default:{}) –Additional keyword arguments passed to the parent LLMHandler (e.g., logger config).
name
property
¶
Return the name of the metric from instance, config, or fallback to class name.
description
property
¶
Return the description of the metric from instance, config, or fallback to class name.
required_fields
property
writable
¶
Returns the required fields for evaluation.
Falls back to configuration if instance-level fields are not explicitly set.
optional_fields
property
writable
¶
Returns the optional fields for evaluation.
Falls back to configuration if instance-level fields are not explicitly set.
metric_category
property
writable
¶
Returns the metric category for this metric.
Falls back to configuration if instance-level value is not explicitly set. Defaults to MetricCategory.SCORE if not defined anywhere.
execute
async
¶
execute(item: Union[DatasetItem, dict], callbacks: Callbacks = None, **kwargs) -> MetricEvaluationResult
Execute the metric evaluation for a single dataset item.
Parameters:
-
item(Union[DatasetItem, dict]) –Input dataset item containing necessary fields for evaluation.
-
callbacks(Callbacks, default:None) –Optional callback handler for events/logging.
Returns:
-
MetricEvaluationResult–An evaluation result conforming to the output model.
get_field ¶
get_field(item: DatasetItem, field_name: str, default: Any = None) -> Any
Resolve a field value from DatasetItem, respecting field_mapping overrides.
If a mapping is defined for the given field_name, this method resolves the value from the mapped source path. Otherwise, it returns the attribute directly from the item.
Parameters:
-
item(DatasetItem) –The DatasetItem to extract from
-
field_name(str) –Canonical field name (e.g., 'actual_output')
-
default(Any, default:None) –Value to return if field is not found
Returns:
-
Any–The resolved field value
Example
With field_mapping={'actual_output': 'additional_output.summary'}¶
value = self.get_field(item, 'actual_output') # Gets item.additional_output['summary']
get_mapped_fields ¶
get_mapped_fields(item: DatasetItem) -> Dict[str, Any]
Return all required and optional fields with resolved values.
This convenience method resolves all configured fields (both required and optional) from the DatasetItem, applying any field mappings.
Parameters:
-
item(DatasetItem) –The DatasetItem to extract fields from
Returns:
-
Dict[str, Any]–Dictionary mapping field names to their resolved values
get_evaluation_fields ¶
get_evaluation_fields(item: Union[DatasetItem, dict]) -> Union[DatasetItem, InputModel]
Extracts the appropriate evaluation fields from the dataset item.
Priority is given to explicitly set required and optional fields on the instance. If not defined, configuration-based fields are used. If none are available, the item's default evaluation fields are returned.
Parameters:
-
item(DatasetItem) –The input dataset item.
Returns:
-
Union[DatasetItem, InputModel]–DatasetItem | InputModel: A dataset item containing only the relevant fields for evaluation.
set_instruction ¶
Set a new instruction string for the metric.
Parameters:
-
instruction(str) –The updated task instruction that guides the metric’s behavior or LLM prompt.
set_examples ¶
set_examples(examples: List[Tuple[DatasetItem, MetricEvaluationResult]])
Replace all current examples with a new set.
Parameters:
-
examples(List[Tuple[DatasetItem, EvaluationResult]]) –A list of example input-output pairs used for few-shot prompting or metric calibration.
Example
[ ( DatasetItem( expected_output='....', actual_output='...', ), MetricEvaluationResult( score=..., explanation="...", ), ), ]
add_examples ¶
add_examples(examples: List[Tuple[DatasetItem, MetricEvaluationResult]])
Add new example input-output pairs to the existing list of examples.
Parameters:
-
examples(List[Tuple[DatasetItem, EvaluationResult]]) –One or more examples to add to the current list, extending few-shot prompting context.
Example
[ ( DatasetItem( expected_output='....', actual_output='...', ), MetricEvaluationResult( score=..., explanation="...", ), ), ]
compute_cost_estimate ¶
compute_cost_estimate(sub_models: List[BaseMetric])
Computes the total estimated cost from sub-models for this execution.
Parameters:
-
sub_models(List[BaseMetric]) –List of sub-models that may have a cost_estimate.
display_prompt ¶
Displays the fully constructed prompt that will be sent to the LLM.
Parameters:
-
item(Union[dict, InputModel], default:None) –The input data to be included in the prompt. If None, a placeholder is used. Defaults to None.
get_sub_metrics ¶
Override to define how results explode into sub-metrics.
This method is called when is_multi_metric=True to extract individual
sub-metric scores from a single evaluation result. The default implementation
returns an empty list, meaning no explosion occurs.
Parameters:
-
result(MetricEvaluationResult) –The evaluation result from execute() containing signals and metadata.
Returns:
-
List[SubMetricResult]–List of SubMetricResult objects representing individual sub-metrics.
-
List[SubMetricResult]–Returns empty list by default (no explosion).
Example
def get_sub_metrics(self, result: MetricEvaluationResult) -> List[SubMetricResult]: signals = result.signals if not signals: return []
return [
SubMetricResult(
name='engagement',
score=signals.engagement_score,
group='behavioral',
),
SubMetricResult(
name='sentiment',
score=signals.sentiment_score,
group='sentiment',
threshold=0.5,
),
]
MetricRegistry¶
axion.metrics.MetricRegistry ¶
Registry for storing and retrieving metric classes.
register ¶
register(metric_class: Type[BaseMetric]) -> None
Register a metric class into the registry.
Parameters:
-
metric_class(Type[BaseMetric]) –A class inheriting from BaseMetric with a valid config.
get ¶
get(key: str, error: bool = True) -> Optional[Type[BaseMetric]]
Retrieve a registered metric class by key.
Parameters:
-
key(str) –The unique key of the metric.
-
error(bool, default:True) –If True, raise an error if the key is not found. If False, return None instead.
Returns:
-
Optional[Type[BaseMetric]]–The registered metric class, or None if not found and error=False.
find ¶
find(query: str) -> List[Type[BaseMetric]]
Search for metrics whose name, description, or tags match a query.
Parameters:
-
query(str) –Case-insensitive search string.
Returns:
-
List[Type[BaseMetric]]–A list of matching metric classes.
get_compatible_metrics ¶
get_compatible_metrics(item: DatasetItem) -> List[Type[BaseMetric]]
Return all metrics compatible with a given DatasetItem.
Parameters:
-
item(DatasetItem) –The dataset item to test against.
Returns:
-
List[Type[BaseMetric]]–A list of compatible metric classes.
get_metric_descriptions ¶
Return {metric_name: description} from the registry.
display ¶
Display a summary of all registered metrics.
Parameters:
-
show_examples(bool, default:False) –Show custom LLM examples
@metric decorator¶
axion.metrics.base.metric ¶
metric(name: str, description: str, required_fields: List[str], optional_fields: Optional[List[str]] = None, key: Optional[str] = None, metric_category: MetricCategory = SCORE, default_threshold: Optional[float] = 0.5, score_range: Optional[tuple[Union[int, float], Union[int, float]]] = (0, 1), tags: Optional[List[str]] = None) -> Callable[[Type[BaseMetric]], Type[BaseMetric]]
Decorator to define and register a metric class with declarative configuration.
Parameters:
-
name(str) –Human-readable name of the metric.
-
description(str) –Description of what the metric measures.
-
required_fields(List[str]) –Fields that must be present in the DatasetItem to evaluate this metric.
-
optional_fields(Optional[List[str]], default:None) –Optional fields the metric may use if available.
-
key(Optional[str], default:None) –Optional. A unique programmatic identifier for the metric. If not provided, it's generated from the name.
-
metric_category(MetricCategory, default:SCORE) –The category of metric output: SCORE (numeric), ANALYSIS (structured insights), or CLASSIFICATION (labels). Defaults to SCORE.
-
default_threshold(Optional[float], default:0.5) –The default threshold to consider a score as 'passing'. Optional for ANALYSIS metrics.
-
score_range(Optional[tuple[Union[int, float], Union[int, float]]], default:(0, 1)) –Tuple representing the valid score range for this metric. Optional for ANALYSIS metrics.
-
tags(Optional[List[str]], default:None) –Searchable tags to group or filter metrics.
Returns:
-
Callable[[Type[BaseMetric]], Type[BaseMetric]]–A class decorator that attaches config and registers the metric in the MetricRegistry.
Raises:
-
TypeError–If the decorated class is not a subclass of BaseMetric.
Composite Metrics¶
Faithfulness¶
axion.metrics.Faithfulness ¶
Faithfulness(mode: EvaluationMode = GRANULAR, strict_mode: bool = False, verdict_scores: Optional[Dict[str, float]] = None, **kwargs)
Bases: BaseMetric
Measures how faithful the generated answer is to the retrieved context. Default Scoring: Fully Supported (+1.0), Partially Supported (+0.5), No Evidence (0.0), Contradictory (-1.0)
Initializes the Faithfulness metric. Args: mode: The evaluation mode for the underlying RAG analyzer. strict_mode (bool): If True, sets 'NO_EVIDENCE' to -1.0, penalizing uncited claims (hallucinations) as heavily as contradictions. This is overridden by 'verdict_scores' if provided. verdict_scores: A dictionary to override the default scoring weights (e.g., {"CONTRADICTORY": -2.0, "PARTIALLY_SUPPORTED": 0.75}). If provided, this takes precedence over 'strict_mode'. **kwargs: Additional keyword arguments passed to the RAGAnalyzer.
AnswerRelevancy¶
axion.metrics.AnswerRelevancy ¶
AnswerRelevancy(relevancy_mode: Literal['strict', 'task'] = 'task', penalize_ambiguity: bool = False, mode: EvaluationMode = GRANULAR, multi_turn_strategy: Literal['last_turn', 'all_turns'] = 'last_turn', **kwargs)
Bases: BaseMetric
Computes answer relevancy scores by analyzing how well the response addresses the input query. Supports both single-turn and multi-turn (all turns) evaluation.
Initialize the Answer Relevancy metric.
Parameters:
-
relevancy_mode(Literal['strict', 'task'], default:'task') –The mode for judging relevancy. 'strict': Only directly answering statements are relevant. 'task': Closely related, helpful statements are also relevant (default).
-
penalize_ambiguity(bool, default:False) –If True, 'idk' verdicts are scored as 0.0 (irrelevant). If False (default), 'idk' verdicts are scored as 1.0 (relevant).
-
mode(EvaluationMode, default:GRANULAR) –The evaluation mode for the internal RAGAnalyzer.
-
multi_turn_strategy(Literal, default:'last_turn') –How to handle multi-turn conversations. 'last_turn' (default): Evaluates only the last turn. 'all_turns': Evaluates all Human->AI turns in the conversation.
-
**kwargs–Additional arguments passed to parent class.
execute
async
¶
execute(item: DatasetItem, cache: Optional[AnalysisCache] = None) -> MetricEvaluationResult
Compute the score based on criteria.
Automatically handles single-turn or multi-turn evaluation based on
self.multi_turn_strategy and item.conversation.
FactualAccuracy¶
axion.metrics.FactualAccuracy ¶
Bases: BaseMetric
Factual Accuracy Metric.
Process: 1. Decompose 'actual_output' into atomic statements. 2. specific binary check (1/0) for each statement against 'expected_output'. 3. Score = (Sum of 1s) / (Total Statements).
get_signals
staticmethod
¶
Display the binary checklist in the UI.
AnswerCompleteness¶
axion.metrics.AnswerCompleteness ¶
Bases: BaseMetric
Evaluates the completeness of product-related responses using one of two approaches: 1. Aspect-based evaluation (when expected_aspects are provided) 2. Sub-question based evaluation (when expected_aspects are not provided)
Initialize the answer completeness metric with required prompts for both approaches.
Parameters:
-
use_expected_output(bool, default:True) –It True, use expected answer if available, otherwise decompose query
execute
async
¶
execute(item: DatasetItem, **kwargs) -> MetricEvaluationResult
Compute the completeness score, returning a structured result object
in the signals field.
get_signals
staticmethod
¶
Generates a list of detailed signals from the evaluation result.
AnswerCriteria¶
axion.metrics.AnswerCriteria ¶
AnswerCriteria(criteria_key: str = 'Complete', scoring_strategy: Literal['concept', 'aspect', 'weighted'] = 'concept', check_for_contradictions: bool = False, weighted_concept_score_weight: float = 0.7, multi_turn_strategy: Literal['last_turn', 'all_turns'] = 'last_turn', multi_turn_aggregation: Literal['cumulative', 'average'] = 'cumulative', **kwargs)
Bases: BaseMetric
Evaluates responses based on specified criteria. This metric extracts key aspects from the criteria and checks whether each aspect is adequately and accurately addressed in the response.
It supports two modes of operation:
1. Single-Turn / Last-Turn (default): Evaluates item.query vs. item.actual_output.
If item.conversation is present, item.query and item.actual_output are
auto-populated from the last turn (based on conversation_extraction_strategy).
2. Multi-Turn: If multi_turn_strategy='all_turns', this metric will iterate
through the entire item.conversation and evaluate every HumanMessage -> AIMessage
pair. The aggregation method is controlled by multi_turn_aggregation.
Initialize the criteria-based answer criteria metric.
Parameters:
-
criteria_key(str, default:'Complete') –The key in
additional_inputorconversation.rubricsto find the criteria text (default: 'Complete'). -
scoring_strategy(Literal['concept', 'aspect', 'weighted'], default:'concept') –The scoring method: 'concept', 'aspect', or 'weighted' (default: 'concept').
-
check_for_contradictions(bool, default:False) –If True, uses a stricter prompt to penalize contradictions (default: False).
-
weighted_concept_score_weight(float, default:0.7) –The weight for the concept score in 'weighted' strategy (default: 0.7).
-
multi_turn_strategy(Literal['last_turn', 'all_turns'], default:'last_turn') –How to handle multi-turn conversations. 'last_turn' (default): Evaluates only the last turn. 'all_turns': Evaluates all Human->AI turns in the conversation.
-
multi_turn_aggregation(Literal['cumulative', 'average'], default:'cumulative') –Aggregation method for 'all_turns' strategy. 'cumulative' (default): Scores unique aspects covered across all turns. 'average': Scores average aspect coverage per turn.
-
**kwargs–Additional arguments passed to parent class
execute
async
¶
execute(item: DatasetItem, **kwargs) -> MetricEvaluationResult
Compute the score based on criteria.
Automatically handles single-turn or multi-turn evaluation based on
self.multi_turn_strategy and item.conversation.
get_signals ¶
Generates a list of detailed signals from the evaluation result that explain the scoring.
ContextualRelevancy¶
axion.metrics.ContextualRelevancy ¶
Bases: BaseMetric
Measures the relevancy of retrieval (how much retrieved content is relevant).
get_signals ¶
Defines the explainable signals for the ContextualRelevancy metric.
ContextualPrecision¶
axion.metrics.ContextualPrecision ¶
Bases: BaseMetric
Measures the quality of retrieval ranking using Mean Average Precision (MAP).
get_signals ¶
Defines the explainable signals for the ContextualRanking metric.
ContextualRecall¶
axion.metrics.ContextualRecall ¶
Bases: BaseMetric
Measures recall of retrieval (how much of the expected answer is in context).
get_signals ¶
Defines the explainable signals for the ContextualRecall metric.
Heuristic Metrics¶
ExactStringMatch¶
axion.metrics.ExactStringMatch ¶
ExactStringMatch(model_name: Optional[str] = None, llm: Optional[LLMRunnable] = None, embed_model_name: Optional[str] = None, embed_model: Optional[EmbeddingRunnable] = None, threshold: float = None, llm_provider: Optional[str] = None, required_fields: Optional[List[str]] = None, optional_fields: Optional[List[str]] = None, metric_name: Optional[str] = None, metric_description: Optional[str] = None, name: Optional[str] = None, field_mapping: Optional[Dict[str, str]] = None, metric_category: Optional[MetricCategory] = None, **kwargs: Any)
Bases: BaseMetric
execute
async
¶
execute(item: DatasetItem, **kwargs) -> MetricEvaluationResult
Returns 1.0 if the actual output exactly matches the expected output (after stripping). Returns 0.0 otherwise.
CitationPresence¶
axion.metrics.CitationPresence ¶
CitationPresence(mode: str = 'any_citation', strict: bool = False, embed_model: Optional[EmbeddingRunnable] = None, use_semantic_search: bool = False, resource_similarity_threshold: float = 0.8, custom_resource_phrases: Optional[List[str]] = None, **kwargs)
Bases: BaseMetric
A metric to evaluate if the response includes properly formatted citations, supporting single-turn or multi-turn conversations.
Initialize the Citation Presence metric.
Parameters:
-
mode(str, default:'any_citation') –Evaluation mode - "any_citation" or "resource_section".
-
strict(bool, default:False) –If True, validates that found URLs are live by making a HEAD request.
-
embed_model(Optional[EmbeddingRunnable], default:None) –Embedding model for semantic similarity.
-
use_semantic_search(bool, default:False) –If True, uses the embedding model as a fallback.
-
resource_similarity_threshold(float, default:0.8) –Threshold for semantic similarity.
-
custom_resource_phrases(Optional[List[str]], default:None) –Custom phrases to look for when detecting resource sections.
-
**kwargs–Additional arguments passed to BaseMetric.
execute
async
¶
execute(item: DatasetItem, **kwargs) -> MetricEvaluationResult
Evaluate citation presence for the entire conversation or single-turn response.
get_signals ¶
Generates detailed signals from the presence evaluation.
Latency¶
axion.metrics.Latency ¶
Bases: BaseMetric
Initialize the Latency metric.
Parameters:
-
normalize(bool, default:False) –If True, normalize latency scores to [0, 1] range.
-
normalization_method(str, default:'exponential') –Method for normalization. Options: - 'exponential': Uses exp(-latency/threshold) for smooth decay - 'sigmoid': Uses 1/(1 + exp((latency-threshold)/scale)) for S-curve - 'reciprocal': Uses threshold/(threshold + latency) for hyperbolic decay - 'linear': Uses max(0, 1 - latency/threshold) for linear decay
-
**kwargs–Additional arguments passed to the base metric.
execute
async
¶
execute(item: DatasetItem, **kwargs) -> MetricEvaluationResult
Returns the latency recorded for a given test case as the metric score.
This metric assumes the latency field is already populated on the DatasetItem
and returns it as-is or normalized based on the initialization parameters.
Parameters:
-
item(DatasetItem) –The evaluation data point containing latency information.
Returns:
-
MetricEvaluationResult(MetricEvaluationResult) –The result object containing the latency as the score.
Retrieval Metrics¶
HitRateAtK¶
axion.metrics.HitRateAtK ¶
Bases: _RetrievalMetric
Evaluates if any relevant document was retrieved in the top K results. Score is 1 if a hit is found, 0 otherwise. Now supports multiple K values.
Initialize the Hit Rate @ K metric. Args: k: The number of top results to consider, or a list of K values. main_k: The K value to use for the main metric score (defaults to max K in k_list).
get_signals ¶
Generates signals detailing the hit rate calculation for all K values.
MeanReciprocalRank¶
axion.metrics.MeanReciprocalRank ¶
MeanReciprocalRank(model_name: Optional[str] = None, llm: Optional[LLMRunnable] = None, embed_model_name: Optional[str] = None, embed_model: Optional[EmbeddingRunnable] = None, threshold: float = None, llm_provider: Optional[str] = None, required_fields: Optional[List[str]] = None, optional_fields: Optional[List[str]] = None, metric_name: Optional[str] = None, metric_description: Optional[str] = None, name: Optional[str] = None, field_mapping: Optional[Dict[str, str]] = None, metric_category: Optional[MetricCategory] = None, **kwargs: Any)
Bases: _RetrievalMetric
Calculates the Mean Reciprocal Rank (MRR). Score is 1 / (rank of first relevant item). This metric is K-independent.
get_signals ¶
Generates signals detailing the Mean Reciprocal Rank calculation.
Conversational Metrics¶
GoalCompletion¶
axion.metrics.GoalCompletion ¶
GoalCompletion(goal_key: str = 'goal', completion_weight: float = 0.6, efficiency_weight: float = 0.4, bottleneck_threshold: int = 5, max_clarification_penalty: float = 0.3, clarification_penalty_rate: float = 0.1, goal_drift_threshold: float = 0.3, outcome_threshold_achieved: float = 0.8, outcome_threshold_partial: float = 0.4, **kwargs)
Bases: BaseMetric
Composite metric that analyzes: 1. Goal achievement (did we complete the task?) 2. Conversation efficiency (how well did we complete it?) 3. Temporal tracking (when/how did we achieve sub-goals?)
Reuses conversation analysis from shared components to minimize LLM calls. Uses unified analysis approach to reduce LLM calls from ~3+N to ~3 total.
Initialize the Goal Completion metric.
Parameters:
-
goal_key(str, default:'goal') –Key in additional_input containing the user's goal
-
completion_weight(float, default:0.6) –Weight for pure goal achievement (default: 0.6) Rationale: Goal achievement is slightly more important than efficiency
-
efficiency_weight(float, default:0.4) –Weight for conversation efficiency (default: 0.4) Rationale: Efficiency matters, but achieving the goal is primary
-
bottleneck_threshold(int, default:5) –Number of turns before a sub-goal is flagged as bottleneck (default: 5) Rationale: Most sub-goals should resolve within 3-5 turns in efficient conversations
-
max_clarification_penalty(float, default:0.3) –Maximum penalty for clarifications (default: 0.3) Rationale: Excessive clarifications can reduce efficiency by up to 30%
-
clarification_penalty_rate(float, default:0.1) –Penalty per clarification (default: 0.1) Rationale: Each clarification represents a 10% efficiency loss
-
goal_drift_threshold(float, default:0.3) –Fraction of unmapped moments to trigger drift detection (default: 0.3) Rationale: If >30% of conversation is unrelated to goal, it indicates drift
-
outcome_threshold_achieved(float, default:0.8) –Minimum score for "achieved" outcome (default: 0.8) Rationale: 80%+ completion indicates successful goal achievement
-
outcome_threshold_partial(float, default:0.4) –Minimum score for "partially_achieved" outcome (default: 0.4) Rationale: 40-80% completion indicates partial success
execute
async
¶
execute(item: DatasetItem, cache: Optional[AnalysisCache] = None) -> MetricEvaluationResult
Execute goal completion analysis using unified approach.
get_signals ¶
Generate comprehensive signals showing multi-layered analysis.
ConversationFlow¶
axion.metrics.ConversationFlow ¶
Bases: BaseMetric
Refactored conversation flow metric with modular, testable components.
Improvements over v1: - Configurable penalties and thresholds - Enum-based issue types (no string matching) - Separate, testable detector classes - Transparent score decomposition - Better statistical methods - Comprehensive signal generation
execute
async
¶
execute(item: DatasetItem, cache: Optional[AnalysisCache] = None) -> MetricEvaluationResult
Execute comprehensive conversation flow analysis.
get_signals ¶
Generate comprehensive signals showing score calculation.