Runners API Reference¶
Evaluation execution engines for parallel and batch processing.
evaluation_runner
High-level function for running complete evaluations across datasets with multiple metrics in parallel.
MetricRunner
Lower-level runner for executing a single metric across dataset items with concurrency control.
evaluation_runner¶
axion.runners.evaluation_runner ¶
evaluation_runner(evaluation_inputs: Union[Dataset, List[DatasetItem], DataFrame], evaluation_name: str, scoring_config: Optional[Union[List[Any], Dict[str, Any], str]] = None, scoring_metrics: Optional[List[Any]] = None, scoring_strategy: Optional[Union[BaseScoringStrategy, str, ScoringStrategyType]] = None, task: Optional[Union[Callable, BaseAPIRunner]] = None, scoring_key_mapping: Optional[Dict[str, str]] = None, evaluation_description: Optional[str] = None, evaluation_metadata: Optional[Dict[str, Any]] = None, max_concurrent: int = 5, throttle_delay: float = 0.0, summary_generator: Optional[BaseSummary] = None, cache_config: Optional[CacheConfig] = None, error_config: Optional[ErrorConfig] = None, enable_internal_caching: bool = True, thresholds: Optional[Dict[str, float]] = None, show_progress: bool = True, dataset_name: Optional[str] = None, run_id: Optional[str] = None, trace_granularity: Union[TraceGranularity, str] = SINGLE_TRACE, flush_per_metric: bool = False, enable_prompt_caching: bool = False) -> Optional[EvaluationResult]
Synchronously runs an evaluation experiment to evaluate metrics over a given dataset, supporting both flat and hierarchical scoring structures.
Parameters:
-
evaluation_inputs(Union[Dataset, List[DatasetItem], DataFrame]) –The input dataset to evaluate.
-
evaluation_name(str) –A unique name to identify the evaluation.
-
scoring_config(Optional[Union[List[Any], Dict[str, Any], str]], default:None) –The scoring configuration. Can be: - A list of metrics for flat evaluation - A dictionary with 'metric' key for flat evaluation (when scoring_strategy='flat') - A dictionary for hierarchical (EvalTree) evaluation (with model, weights, etc.) - A string file path to a YAML configuration file
-
scoring_metrics(Optional[List[Any]], default:None) –An alternative, more intuitive parameter for passing a flat list of metrics.
-
scoring_strategy(Optional[Union[BaseScoringStrategy, str, ScoringStrategyType]], default:None) –Defines the scoring method. Can be a pre-initialized strategy instance or a string/Enum alias ('flat' or 'tree'). Overrides auto-detection.
-
task(Optional[Union[Callable, BaseAPIRunner]], default:None) –A custom function to generate model outputs before scoring.
-
scoring_key_mapping(Optional[Dict[str, str]], default:None) –Maps metric input names to dataset column names.
-
evaluation_description(Optional[str], default:None) –A human-readable description of the evaluation.
-
evaluation_metadata(Optional[Dict[str, Any]], default:None) –Additional metadata to include in the evaluation trace.
-
max_concurrent(int, default:5) –Maximum number of concurrent evaluations. Defaults to 5.
-
throttle_delay(float, default:0.0) –Specifies the time in seconds to pause after each individual task execution. This is used as a client-side throttle to help prevent API rate limit errors when processing a large number of items. Defaults to 0.0 (no delay).
-
summary_generator(Optional[BaseSummary], default:None) –A summary generator for high-level results.
-
cache_config(CacheConfig, default:None) –Configuration for caching results to avoid recomputation.
-
error_config(ErrorConfig, default:None) –Configuration for handling errors during evaluation.
-
enable_internal_caching(bool, default:True) –Enables a per-item cache for metrics that share expensive internal computations. Defaults to True.
-
thresholds(Optional[Dict[str, float]], default:None) –Performance thresholds for each metric.
-
show_progress(bool, default:True) –Whether to show a progress bar. Defaults to True.
-
dataset_name(Optional[str], default:None) –Optional name of the dataset.
-
run_id(Optional[str], default:None) –An optional identifier for this specific run.
-
trace_granularity(Union[TraceGranularity, str], default:SINGLE_TRACE) –Controls trace granularity during evaluation. Accepts enum or string values: - 'single_trace' / 'single' / SINGLE_TRACE (default): All evaluations under one parent trace - 'separate' / SEPARATE: Each metric execution gets its own independent trace
-
flush_per_metric(bool, default:False) –When trace_granularity='separate', controls whether each metric trace is flushed immediately (slower, but more "live" in the UI) vs batched (faster). Defaults to False.
-
enable_prompt_caching(bool, default:False) –Enables provider-level prompt caching for all metrics. When True, propagates to all metrics and their sub-judges to mark system/few-shot prefixes as cacheable. Supports Anthropic (explicit caching) and OpenAI (automatic). Defaults to False.
Returns:
-
EvaluationResult(Optional[EvaluationResult]) –An object containing detailed metric scores, summary, and metadata.
EvaluationRunner¶
axion.runners.EvaluationRunner ¶
EvaluationRunner(config: EvaluationConfig, tracer: Optional[BaseTraceHandler] = None)
Bases: RunnerMixin
Orchestrates the execution of evaluation experiments, managing task execution, metric scoring, and configuration. Automatically determines and initializes the appropriate scoring strategy (flat or hierarchical).
summary
property
¶
Returns the summary from the active scoring strategy. For hierarchical ('tree') strategies, this provides the detailed tree summary.
tree
property
¶
Returns the underlying EvalTree instance for inspection, if the 'tree' strategy is active. Raises an AttributeError for other strategies.
execute
async
¶
execute() -> EvaluationResult
Executes the entire evaluation and returns the final result.
For SINGLE_TRACE mode, wraps execution in a trace span. For PER_ITEM and SEPARATE modes, skips the wrapper span to allow each item/metric to create its own independent trace.
EvaluationConfig¶
axion.runners.EvaluationConfig
dataclass
¶
EvaluationConfig(evaluation_name: str, evaluation_inputs: Union[Dataset, List[DatasetItem], DataFrame], scoring_config: Optional[Union[List[Any], Dict[str, Any], str]] = None, scoring_metrics: Optional[List[Any]] = None, scoring_strategy: Optional[Union[BaseScoringStrategy, str, ScoringStrategyType]] = None, task: Optional[Union[Callable, BaseAPIRunner]] = None, scoring_key_mapping: Optional[Dict[str, str]] = None, evaluation_description: Optional[str] = None, evaluation_metadata: Optional[Dict[str, Any]] = None, max_concurrent: int = 5, throttle_delay: Optional[float] = 0.0, summary_generator: Optional[BaseSummary] = MetricSummary(), cache_config: CacheConfig = CacheConfig(), error_config: ErrorConfig = ErrorConfig(), thresholds: Optional[Dict[str, float]] = None, show_progress: bool = True, dataset_name: Optional[str] = None, run_id: Optional[str] = None, enable_internal_caching: bool = True, trace_granularity: Union[TraceGranularity, str] = SINGLE_TRACE, flush_per_metric: bool = False, enable_prompt_caching: bool = False)
Configuration for an evaluation run.
Attributes:
-
evaluation_inputs(Union[Dataset, List[DatasetItem], DataFrame]) –The input dataset to evaluate. Can be a high-level
Datasetobject, a list of individualDatasetItemobjects, or a preloadedpandas.DataFrame. -
scoring_config(Optional[Union[List[Any], Dict[str, Any], str]]) –The scoring configuration. Can be: - A list of metrics for flat evaluation - A dictionary with 'metric' key for flat evaluation (when scoring_strategy='flat') - A dictionary for hierarchical (EvalTree) evaluation (with model, weights, etc.) - A string file path to a YAML configuration file
-
scoring_metrics(List[Any]) –A list of metric objects or callables used to score each item in the dataset.
-
scoring_strategy(Optional[Union[BaseScoringStrategy, str, ScoringStrategyType]]) –Defines the scoring method. Can be a pre-initialized strategy instance or a string/Enum alias ('flat' or 'tree'). Overrides auto-detection.
-
evaluation_name(str) –A unique name to identify the evaluation. Used in trace logging and result storage.
-
task(Optional[Union[Callable, BaseAPIRunner]]) –A custom function to generate predictions or transform inputs. If provided, it will be run before scoring to produce the model output for each dataset item.
-
scoring_key_mapping(Optional[Dict[str, str]]) –An optional dictionary mapping metric input names to dataset column names. Useful for adapting metrics to different schema formats.
-
evaluation_description(Optional[str]) –A human-readable description of the evaluation for documentation and trace metadata.
-
evaluation_metadata(Optional[Dict[str, Any]]) –Additional metadata to include in the evaluation trace (e.g., model version, data slice info, tags).
-
max_concurrent(int) –Maximum number of metric evaluations to run concurrently. Default is 5.
-
throttle_delay(float) –Specifies the time in seconds to pause after each individual task execution. This is used as a client-side throttle to help prevent API rate limit errors when processing a large number of items. Defaults to 0.0 (no delay).
-
summary_generator(Optional[BaseSummary]) –Optional summary generator used to produce a high-level summary after the evaluation. If not provided, a default
MetricSummaryis used. -
cache_config(CacheConfig) –Configuration for caching metric results to avoid recomputation. Enables both read and write caching.
-
error_config(ErrorConfig) –Configuration for how errors are handled during evaluation. Allows skipping metrics or suppressing failures.
-
thresholds(Optional[Dict[str, float]]) –Optional threshold values for each metric. Used to flag items or datasets that fall below a given performance level.
-
show_progress(bool) –Whether to show a progress bar during evaluation. Defaults to True.
-
dataset_name(Optional[str]) –Optional name of the dataset being evaluated. Used for display and trace logging.
-
run_id(Optional[str]) –An optional identifier for this specific run. Useful for repeatability and audit logging.
-
trace_granularity(Union[TraceGranularity, str]) –Controls trace granularity during evaluation. Accepts enum or string values: - 'single_trace' / 'single' / SINGLE_TRACE (default): All evaluations under one parent trace - 'separate' / SEPARATE: Each metric execution gets its own independent trace
-
enable_prompt_caching(bool) –Enables provider-level prompt caching for all metrics. When True, propagates to all metrics and their sub-judges to mark system/few-shot prefixes as cacheable. Supports Anthropic (explicit caching) and OpenAI (automatic). Defaults to False.
MetricRunner¶
axion.runners.MetricRunner
dataclass
¶
MetricRunner(metrics: List[Any], name: str = 'MetricRunner', description: str = 'Orchestrates evaluation metrics', max_concurrent: int = 5, thresholds: Optional[Dict[str, float]] = None, summary_generator: Optional[BaseSummary] = MetricSummary(), cache_manager: Optional[CacheManager] = None, error_config: ErrorConfig = ErrorConfig(), tracer: Optional[BaseTraceHandler] = None, dataset_name: Optional[str] = 'Metric Runner Dataset', enable_internal_caching: bool = True, trace_granularity: TraceGranularity = SEPARATE, flush_per_metric: bool = False)
Bases: RunnerMixin
Orchestrates the evaluation of multiple metrics against a dataset.
available_types
property
¶
Returns a list of available (registered) metric runner types.
elapsed_time
property
¶
Returns the total execution time for the last batch run.
execute_batch
async
¶
execute_batch(evaluation_inputs: Union[Dataset, List[DatasetItem], DataFrame], *, show_progress: bool = True) -> List[TestResult]
Executes all configured metrics against the provided dataset.
Trace granularity behavior: - SINGLE_TRACE: All metrics run under one parent trace (default) - SEPARATE: Each metric execution gets its own independent trace
CacheManager¶
axion._core.cache.CacheManager ¶
CacheManager(config: CacheConfig = None)
Manages cache operations for both memory and disk, abstracting the backend.
CacheConfig¶
axion._core.cache.CacheConfig
dataclass
¶
CacheConfig(use_cache: bool = True, write_cache: bool = True, cache_type: str = 'memory', cache_dir: Optional[str] = '.cache', cache_task: bool = True)
Configuration class for controlling caching behavior of metric evaluations.
Attributes:
-
use_cache(bool) –If True, attempts to read previously computed results from cache to avoid redundant computation.
-
write_cache(bool) –If True, writes newly computed metric results to cache for future use. Has no effect if
use_cacheis False. -
cache_type(str) –Type of caching backend to use. - 'memory': Uses in-memory dictionary for caching (fast, but non-persistent). - 'disk': Writes cache to disk (persistent across runs).
-
cache_dir(Optional[str]) –Directory path where disk cache files will be stored. Only used when
cache_type='disk'. Defaults to '.cache'. -
cache_task(bool) –If True, enables caching at the task level (e.g., for full evaluation runs). If False, caching applies only at the metric level.