Runners API Reference¶

Evaluation execution engines for parallel and batch processing.

from axion.runners import (
    evaluation_runner,
    EvaluationRunner,
    EvaluationConfig,
    MetricRunner,
)
from axion._core.cache import CacheManager, CacheConfig

E

evaluation_runner

High-level function for running complete evaluations across datasets with multiple metrics in parallel.

M

MetricRunner

Lower-level runner for executing a single metric across dataset items with concurrency control.

evaluation_runner¶

axion.runners.evaluation_runner ¶

evaluation_runner(evaluation_inputs: Union[Dataset, List[DatasetItem], DataFrame], evaluation_name: str, scoring_config: Optional[Union[List[Any], Dict[str, Any], str]] = None, scoring_metrics: Optional[List[Any]] = None, scoring_strategy: Optional[Union[BaseScoringStrategy, str, ScoringStrategyType]] = None, task: Optional[Union[Callable, BaseAPIRunner]] = None, scoring_key_mapping: Optional[Dict[str, str]] = None, evaluation_description: Optional[str] = None, evaluation_metadata: Optional[Dict[str, Any]] = None, max_concurrent: int = 5, throttle_delay: float = 0.0, summary_generator: Optional[BaseSummary] = None, cache_config: Optional[CacheConfig] = None, error_config: Optional[ErrorConfig] = None, enable_internal_caching: bool = True, thresholds: Optional[Dict[str, float]] = None, show_progress: bool = True, dataset_name: Optional[str] = None, run_id: Optional[str] = None, trace_granularity: Union[TraceGranularity, str] = SINGLE_TRACE, flush_per_metric: bool = False, enable_prompt_caching: bool = False) -> Optional[EvaluationResult]

Synchronously runs an evaluation experiment to evaluate metrics over a given dataset, supporting both flat and hierarchical scoring structures.

Parameters:

evaluation_inputs (Union[Dataset, List[DatasetItem], DataFrame]) –

The input dataset to evaluate.
evaluation_name (str) –

A unique name to identify the evaluation.
scoring_config (Optional[Union[List[Any], Dict[str, Any], str]], default: None ) –

The scoring configuration. Can be: - A list of metrics for flat evaluation - A dictionary with 'metric' key for flat evaluation (when scoring_strategy='flat') - A dictionary for hierarchical (EvalTree) evaluation (with model, weights, etc.) - A string file path to a YAML configuration file
scoring_metrics (Optional[List[Any]], default: None ) –

An alternative, more intuitive parameter for passing a flat list of metrics.
scoring_strategy (Optional[Union[BaseScoringStrategy, str, ScoringStrategyType]], default: None ) –

Defines the scoring method. Can be a pre-initialized strategy instance or a string/Enum alias ('flat' or 'tree'). Overrides auto-detection.
task (Optional[Union[Callable, BaseAPIRunner]], default: None ) –

A custom function to generate model outputs before scoring.
scoring_key_mapping (Optional[Dict[str, str]], default: None ) –

Maps metric input names to dataset column names.
evaluation_description (Optional[str], default: None ) –

A human-readable description of the evaluation.
evaluation_metadata (Optional[Dict[str, Any]], default: None ) –

Additional metadata to include in the evaluation trace.
max_concurrent (int, default: 5 ) –

Maximum number of concurrent evaluations. Defaults to 5.
throttle_delay (float, default: 0.0 ) –

Specifies the time in seconds to pause after each individual task execution. This is used as a client-side throttle to help prevent API rate limit errors when processing a large number of items. Defaults to 0.0 (no delay).
summary_generator (Optional[BaseSummary], default: None ) –

A summary generator for high-level results.
cache_config (CacheConfig, default: None ) –

Configuration for caching results to avoid recomputation.
error_config (ErrorConfig, default: None ) –

Configuration for handling errors during evaluation.
enable_internal_caching (bool, default: True ) –

Enables a per-item cache for metrics that share expensive internal computations. Defaults to True.
thresholds (Optional[Dict[str, float]], default: None ) –

Performance thresholds for each metric.
show_progress (bool, default: True ) –

Whether to show a progress bar. Defaults to True.
dataset_name (Optional[str], default: None ) –

Optional name of the dataset.
run_id (Optional[str], default: None ) –

An optional identifier for this specific run.
trace_granularity (Union[TraceGranularity, str], default: SINGLE_TRACE ) –

Controls trace granularity during evaluation. Accepts enum or string values: - 'single_trace' / 'single' / SINGLE_TRACE (default): All evaluations under one parent trace - 'separate' / SEPARATE: Each metric execution gets its own independent trace
flush_per_metric (bool, default: False ) –

When trace_granularity='separate', controls whether each metric trace is flushed immediately (slower, but more "live" in the UI) vs batched (faster). Defaults to False.
enable_prompt_caching (bool, default: False ) –

Enables provider-level prompt caching for all metrics. When True, propagates to all metrics and their sub-judges to mark system/few-shot prefixes as cacheable. Supports Anthropic (explicit caching) and OpenAI (automatic). Defaults to False.

Returns:

EvaluationResult ( Optional[EvaluationResult] ) –

An object containing detailed metric scores, summary, and metadata.

EvaluationRunner¶

axion.runners.EvaluationRunner ¶

EvaluationRunner(config: EvaluationConfig, tracer: Optional[BaseTraceHandler] = None)

Bases: RunnerMixin

Orchestrates the execution of evaluation experiments, managing task execution, metric scoring, and configuration. Automatically determines and initializes the appropriate scoring strategy (flat or hierarchical).

summary `property` ¶

summary: Union[Dict[str, Any], None]

Returns the summary from the active scoring strategy. For hierarchical ('tree') strategies, this provides the detailed tree summary.

tree `property` ¶

tree: Any

Returns the underlying EvalTree instance for inspection, if the 'tree' strategy is active. Raises an AttributeError for other strategies.

execute `async` ¶

execute() -> EvaluationResult

Executes the entire evaluation and returns the final result.

For SINGLE_TRACE mode, wraps execution in a trace span. For PER_ITEM and SEPARATE modes, skips the wrapper span to allow each item/metric to create its own independent trace.

display `classmethod` ¶

display()

Display Usage Documentation

EvaluationConfig¶

axion.runners.EvaluationConfig `dataclass` ¶

EvaluationConfig(evaluation_name: str, evaluation_inputs: Union[Dataset, List[DatasetItem], DataFrame], scoring_config: Optional[Union[List[Any], Dict[str, Any], str]] = None, scoring_metrics: Optional[List[Any]] = None, scoring_strategy: Optional[Union[BaseScoringStrategy, str, ScoringStrategyType]] = None, task: Optional[Union[Callable, BaseAPIRunner]] = None, scoring_key_mapping: Optional[Dict[str, str]] = None, evaluation_description: Optional[str] = None, evaluation_metadata: Optional[Dict[str, Any]] = None, max_concurrent: int = 5, throttle_delay: Optional[float] = 0.0, summary_generator: Optional[BaseSummary] = MetricSummary(), cache_config: CacheConfig = CacheConfig(), error_config: ErrorConfig = ErrorConfig(), thresholds: Optional[Dict[str, float]] = None, show_progress: bool = True, dataset_name: Optional[str] = None, run_id: Optional[str] = None, enable_internal_caching: bool = True, trace_granularity: Union[TraceGranularity, str] = SINGLE_TRACE, flush_per_metric: bool = False, enable_prompt_caching: bool = False)

Configuration for an evaluation run.

Attributes:

evaluation_inputs (Union[Dataset, List[DatasetItem], DataFrame]) –

The input dataset to evaluate. Can be a high-level Dataset object, a list of individual DatasetItem objects, or a preloaded pandas.DataFrame.
scoring_config (Optional[Union[List[Any], Dict[str, Any], str]]) –

The scoring configuration. Can be: - A list of metrics for flat evaluation - A dictionary with 'metric' key for flat evaluation (when scoring_strategy='flat') - A dictionary for hierarchical (EvalTree) evaluation (with model, weights, etc.) - A string file path to a YAML configuration file
scoring_metrics (List[Any]) –

A list of metric objects or callables used to score each item in the dataset.
scoring_strategy (Optional[Union[BaseScoringStrategy, str, ScoringStrategyType]]) –

Defines the scoring method. Can be a pre-initialized strategy instance or a string/Enum alias ('flat' or 'tree'). Overrides auto-detection.
evaluation_name (str) –

A unique name to identify the evaluation. Used in trace logging and result storage.
task (Optional[Union[Callable, BaseAPIRunner]]) –

A custom function to generate predictions or transform inputs. If provided, it will be run before scoring to produce the model output for each dataset item.
scoring_key_mapping (Optional[Dict[str, str]]) –

An optional dictionary mapping metric input names to dataset column names. Useful for adapting metrics to different schema formats.
evaluation_description (Optional[str]) –

A human-readable description of the evaluation for documentation and trace metadata.
evaluation_metadata (Optional[Dict[str, Any]]) –

Additional metadata to include in the evaluation trace (e.g., model version, data slice info, tags).
max_concurrent (int) –

Maximum number of metric evaluations to run concurrently. Default is 5.
throttle_delay (float) –

Specifies the time in seconds to pause after each individual task execution. This is used as a client-side throttle to help prevent API rate limit errors when processing a large number of items. Defaults to 0.0 (no delay).
summary_generator (Optional[BaseSummary]) –

Optional summary generator used to produce a high-level summary after the evaluation. If not provided, a default MetricSummary is used.
cache_config (CacheConfig) –

Configuration for caching metric results to avoid recomputation. Enables both read and write caching.
error_config (ErrorConfig) –

Configuration for how errors are handled during evaluation. Allows skipping metrics or suppressing failures.
thresholds (Optional[Dict[str, float]]) –

Optional threshold values for each metric. Used to flag items or datasets that fall below a given performance level.
show_progress (bool) –

Whether to show a progress bar during evaluation. Defaults to True.
dataset_name (Optional[str]) –

Optional name of the dataset being evaluated. Used for display and trace logging.
run_id (Optional[str]) –

An optional identifier for this specific run. Useful for repeatability and audit logging.
trace_granularity (Union[TraceGranularity, str]) –

Controls trace granularity during evaluation. Accepts enum or string values: - 'single_trace' / 'single' / SINGLE_TRACE (default): All evaluations under one parent trace - 'separate' / SEPARATE: Each metric execution gets its own independent trace
enable_prompt_caching (bool) –

Enables provider-level prompt caching for all metrics. When True, propagates to all metrics and their sub-judges to mark system/few-shot prefixes as cacheable. Supports Anthropic (explicit caching) and OpenAI (automatic). Defaults to False.

MetricRunner¶

axion.runners.MetricRunner `dataclass` ¶

MetricRunner(metrics: List[Any], name: str = 'MetricRunner', description: str = 'Orchestrates evaluation metrics', max_concurrent: int = 5, thresholds: Optional[Dict[str, float]] = None, summary_generator: Optional[BaseSummary] = MetricSummary(), cache_manager: Optional[CacheManager] = None, error_config: ErrorConfig = ErrorConfig(), tracer: Optional[BaseTraceHandler] = None, dataset_name: Optional[str] = 'Metric Runner Dataset', enable_internal_caching: bool = True, trace_granularity: TraceGranularity = SEPARATE, flush_per_metric: bool = False)

Bases: RunnerMixin

Orchestrates the evaluation of multiple metrics against a dataset.

available_types `property` ¶

available_types: List[str]

Returns a list of available (registered) metric runner types.

elapsed_time `property` ¶

elapsed_time: Union[float, None]

Returns the total execution time for the last batch run.

summary `property` ¶

summary: Union[Dict[str, Any], None]

Returns the summary of the last batch run.

execute_batch `async` ¶

execute_batch(evaluation_inputs: Union[Dataset, List[DatasetItem], DataFrame], *, show_progress: bool = True) -> List[TestResult]

Executes all configured metrics against the provided dataset.

Trace granularity behavior: - SINGLE_TRACE: All metrics run under one parent trace (default) - SEPARATE: Each metric execution gets its own independent trace

CacheManager¶

axion._core.cache.CacheManager ¶

CacheManager(config: CacheConfig = None)

Manages cache operations for both memory and disk, abstracting the backend.

get ¶

get(key: str) -> Optional[Any]

Gets an item from the cache if use_cache is True.

set ¶

set(key: str, value: Any)

Sets an item in the cache if write_cache is True.

close ¶

close()

Closes the cache connection if applicable (for diskcache).

CacheConfig¶

axion._core.cache.CacheConfig `dataclass` ¶

CacheConfig(use_cache: bool = True, write_cache: bool = True, cache_type: str = 'memory', cache_dir: Optional[str] = '.cache', cache_task: bool = True)

Configuration class for controlling caching behavior of metric evaluations.

Attributes:

use_cache (bool) –

If True, attempts to read previously computed results from cache to avoid redundant computation.
write_cache (bool) –

If True, writes newly computed metric results to cache for future use. Has no effect if use_cache is False.
cache_type (str) –

Type of caching backend to use. - 'memory': Uses in-memory dictionary for caching (fast, but non-persistent). - 'disk': Writes cache to disk (persistent across runs).
cache_dir (Optional[str]) –

Directory path where disk cache files will be stored. Only used when cache_type='disk'. Defaults to '.cache'.
cache_task (bool) –

If True, enables caching at the task level (e.g., for full evaluation runs). If False, caching applies only at the metric level.

Running Evaluations Guide Evaluation Runner Deep Dive Metric Runner Deep Dive

Runners API Reference¶

evaluation_runner¶

axion.runners.evaluation_runner ¶

EvaluationRunner¶

axion.runners.EvaluationRunner ¶

summary property ¶

tree property ¶

execute async ¶

display classmethod ¶

EvaluationConfig¶

axion.runners.EvaluationConfig dataclass ¶

MetricRunner¶

axion.runners.MetricRunner dataclass ¶

available_types property ¶

elapsed_time property ¶

summary property ¶

execute_batch async ¶

CacheManager¶

axion._core.cache.CacheManager ¶

get ¶

set ¶

close ¶

CacheConfig¶

axion._core.cache.CacheConfig dataclass ¶

summary `property` ¶

tree `property` ¶

execute `async` ¶

display `classmethod` ¶

axion.runners.EvaluationConfig `dataclass` ¶

axion.runners.MetricRunner `dataclass` ¶

available_types `property` ¶

elapsed_time `property` ¶

summary `property` ¶

execute_batch `async` ¶

axion._core.cache.CacheConfig `dataclass` ¶