Skip to content

Runners API Reference

Evaluation execution engines for parallel and batch processing.

from axion.runners import (
    evaluation_runner,
    EvaluationRunner,
    EvaluationConfig,
    MetricRunner,
)
from axion._core.cache import CacheManager, CacheConfig
E

evaluation_runner

High-level function for running complete evaluations across datasets with multiple metrics in parallel.

M

MetricRunner

Lower-level runner for executing a single metric across dataset items with concurrency control.


evaluation_runner

axion.runners.evaluation_runner

evaluation_runner(evaluation_inputs: Union[Dataset, List[DatasetItem], DataFrame], evaluation_name: str, scoring_config: Optional[Union[List[Any], Dict[str, Any], str]] = None, scoring_metrics: Optional[List[Any]] = None, scoring_strategy: Optional[Union[BaseScoringStrategy, str, ScoringStrategyType]] = None, task: Optional[Union[Callable, BaseAPIRunner]] = None, scoring_key_mapping: Optional[Dict[str, str]] = None, evaluation_description: Optional[str] = None, evaluation_metadata: Optional[Dict[str, Any]] = None, max_concurrent: int = 5, throttle_delay: float = 0.0, summary_generator: Optional[BaseSummary] = None, cache_config: Optional[CacheConfig] = None, error_config: Optional[ErrorConfig] = None, enable_internal_caching: bool = True, thresholds: Optional[Dict[str, float]] = None, show_progress: bool = True, dataset_name: Optional[str] = None, run_id: Optional[str] = None, trace_granularity: Union[TraceGranularity, str] = SINGLE_TRACE, flush_per_metric: bool = False, enable_prompt_caching: bool = False) -> Optional[EvaluationResult]

Synchronously runs an evaluation experiment to evaluate metrics over a given dataset, supporting both flat and hierarchical scoring structures.

Parameters:

  • evaluation_inputs (Union[Dataset, List[DatasetItem], DataFrame]) –

    The input dataset to evaluate.

  • evaluation_name (str) –

    A unique name to identify the evaluation.

  • scoring_config (Optional[Union[List[Any], Dict[str, Any], str]], default: None ) –

    The scoring configuration. Can be: - A list of metrics for flat evaluation - A dictionary with 'metric' key for flat evaluation (when scoring_strategy='flat') - A dictionary for hierarchical (EvalTree) evaluation (with model, weights, etc.) - A string file path to a YAML configuration file

  • scoring_metrics (Optional[List[Any]], default: None ) –

    An alternative, more intuitive parameter for passing a flat list of metrics.

  • scoring_strategy (Optional[Union[BaseScoringStrategy, str, ScoringStrategyType]], default: None ) –

    Defines the scoring method. Can be a pre-initialized strategy instance or a string/Enum alias ('flat' or 'tree'). Overrides auto-detection.

  • task (Optional[Union[Callable, BaseAPIRunner]], default: None ) –

    A custom function to generate model outputs before scoring.

  • scoring_key_mapping (Optional[Dict[str, str]], default: None ) –

    Maps metric input names to dataset column names.

  • evaluation_description (Optional[str], default: None ) –

    A human-readable description of the evaluation.

  • evaluation_metadata (Optional[Dict[str, Any]], default: None ) –

    Additional metadata to include in the evaluation trace.

  • max_concurrent (int, default: 5 ) –

    Maximum number of concurrent evaluations. Defaults to 5.

  • throttle_delay (float, default: 0.0 ) –

    Specifies the time in seconds to pause after each individual task execution. This is used as a client-side throttle to help prevent API rate limit errors when processing a large number of items. Defaults to 0.0 (no delay).

  • summary_generator (Optional[BaseSummary], default: None ) –

    A summary generator for high-level results.

  • cache_config (CacheConfig, default: None ) –

    Configuration for caching results to avoid recomputation.

  • error_config (ErrorConfig, default: None ) –

    Configuration for handling errors during evaluation.

  • enable_internal_caching (bool, default: True ) –

    Enables a per-item cache for metrics that share expensive internal computations. Defaults to True.

  • thresholds (Optional[Dict[str, float]], default: None ) –

    Performance thresholds for each metric.

  • show_progress (bool, default: True ) –

    Whether to show a progress bar. Defaults to True.

  • dataset_name (Optional[str], default: None ) –

    Optional name of the dataset.

  • run_id (Optional[str], default: None ) –

    An optional identifier for this specific run.

  • trace_granularity (Union[TraceGranularity, str], default: SINGLE_TRACE ) –

    Controls trace granularity during evaluation. Accepts enum or string values: - 'single_trace' / 'single' / SINGLE_TRACE (default): All evaluations under one parent trace - 'separate' / SEPARATE: Each metric execution gets its own independent trace

  • flush_per_metric (bool, default: False ) –

    When trace_granularity='separate', controls whether each metric trace is flushed immediately (slower, but more "live" in the UI) vs batched (faster). Defaults to False.

  • enable_prompt_caching (bool, default: False ) –

    Enables provider-level prompt caching for all metrics. When True, propagates to all metrics and their sub-judges to mark system/few-shot prefixes as cacheable. Supports Anthropic (explicit caching) and OpenAI (automatic). Defaults to False.

Returns:

  • EvaluationResult ( Optional[EvaluationResult] ) –

    An object containing detailed metric scores, summary, and metadata.


EvaluationRunner

axion.runners.EvaluationRunner

EvaluationRunner(config: EvaluationConfig, tracer: Optional[BaseTraceHandler] = None)

Bases: RunnerMixin

Orchestrates the execution of evaluation experiments, managing task execution, metric scoring, and configuration. Automatically determines and initializes the appropriate scoring strategy (flat or hierarchical).

summary property

summary: Union[Dict[str, Any], None]

Returns the summary from the active scoring strategy. For hierarchical ('tree') strategies, this provides the detailed tree summary.

tree property

tree: Any

Returns the underlying EvalTree instance for inspection, if the 'tree' strategy is active. Raises an AttributeError for other strategies.

execute async

execute() -> EvaluationResult

Executes the entire evaluation and returns the final result.

For SINGLE_TRACE mode, wraps execution in a trace span. For PER_ITEM and SEPARATE modes, skips the wrapper span to allow each item/metric to create its own independent trace.

display classmethod

display()

Display Usage Documentation


EvaluationConfig

axion.runners.EvaluationConfig dataclass

EvaluationConfig(evaluation_name: str, evaluation_inputs: Union[Dataset, List[DatasetItem], DataFrame], scoring_config: Optional[Union[List[Any], Dict[str, Any], str]] = None, scoring_metrics: Optional[List[Any]] = None, scoring_strategy: Optional[Union[BaseScoringStrategy, str, ScoringStrategyType]] = None, task: Optional[Union[Callable, BaseAPIRunner]] = None, scoring_key_mapping: Optional[Dict[str, str]] = None, evaluation_description: Optional[str] = None, evaluation_metadata: Optional[Dict[str, Any]] = None, max_concurrent: int = 5, throttle_delay: Optional[float] = 0.0, summary_generator: Optional[BaseSummary] = MetricSummary(), cache_config: CacheConfig = CacheConfig(), error_config: ErrorConfig = ErrorConfig(), thresholds: Optional[Dict[str, float]] = None, show_progress: bool = True, dataset_name: Optional[str] = None, run_id: Optional[str] = None, enable_internal_caching: bool = True, trace_granularity: Union[TraceGranularity, str] = SINGLE_TRACE, flush_per_metric: bool = False, enable_prompt_caching: bool = False)

Configuration for an evaluation run.

Attributes:

  • evaluation_inputs (Union[Dataset, List[DatasetItem], DataFrame]) –

    The input dataset to evaluate. Can be a high-level Dataset object, a list of individual DatasetItem objects, or a preloaded pandas.DataFrame.

  • scoring_config (Optional[Union[List[Any], Dict[str, Any], str]]) –

    The scoring configuration. Can be: - A list of metrics for flat evaluation - A dictionary with 'metric' key for flat evaluation (when scoring_strategy='flat') - A dictionary for hierarchical (EvalTree) evaluation (with model, weights, etc.) - A string file path to a YAML configuration file

  • scoring_metrics (List[Any]) –

    A list of metric objects or callables used to score each item in the dataset.

  • scoring_strategy (Optional[Union[BaseScoringStrategy, str, ScoringStrategyType]]) –

    Defines the scoring method. Can be a pre-initialized strategy instance or a string/Enum alias ('flat' or 'tree'). Overrides auto-detection.

  • evaluation_name (str) –

    A unique name to identify the evaluation. Used in trace logging and result storage.

  • task (Optional[Union[Callable, BaseAPIRunner]]) –

    A custom function to generate predictions or transform inputs. If provided, it will be run before scoring to produce the model output for each dataset item.

  • scoring_key_mapping (Optional[Dict[str, str]]) –

    An optional dictionary mapping metric input names to dataset column names. Useful for adapting metrics to different schema formats.

  • evaluation_description (Optional[str]) –

    A human-readable description of the evaluation for documentation and trace metadata.

  • evaluation_metadata (Optional[Dict[str, Any]]) –

    Additional metadata to include in the evaluation trace (e.g., model version, data slice info, tags).

  • max_concurrent (int) –

    Maximum number of metric evaluations to run concurrently. Default is 5.

  • throttle_delay (float) –

    Specifies the time in seconds to pause after each individual task execution. This is used as a client-side throttle to help prevent API rate limit errors when processing a large number of items. Defaults to 0.0 (no delay).

  • summary_generator (Optional[BaseSummary]) –

    Optional summary generator used to produce a high-level summary after the evaluation. If not provided, a default MetricSummary is used.

  • cache_config (CacheConfig) –

    Configuration for caching metric results to avoid recomputation. Enables both read and write caching.

  • error_config (ErrorConfig) –

    Configuration for how errors are handled during evaluation. Allows skipping metrics or suppressing failures.

  • thresholds (Optional[Dict[str, float]]) –

    Optional threshold values for each metric. Used to flag items or datasets that fall below a given performance level.

  • show_progress (bool) –

    Whether to show a progress bar during evaluation. Defaults to True.

  • dataset_name (Optional[str]) –

    Optional name of the dataset being evaluated. Used for display and trace logging.

  • run_id (Optional[str]) –

    An optional identifier for this specific run. Useful for repeatability and audit logging.

  • trace_granularity (Union[TraceGranularity, str]) –

    Controls trace granularity during evaluation. Accepts enum or string values: - 'single_trace' / 'single' / SINGLE_TRACE (default): All evaluations under one parent trace - 'separate' / SEPARATE: Each metric execution gets its own independent trace

  • enable_prompt_caching (bool) –

    Enables provider-level prompt caching for all metrics. When True, propagates to all metrics and their sub-judges to mark system/few-shot prefixes as cacheable. Supports Anthropic (explicit caching) and OpenAI (automatic). Defaults to False.


MetricRunner

axion.runners.MetricRunner dataclass

MetricRunner(metrics: List[Any], name: str = 'MetricRunner', description: str = 'Orchestrates evaluation metrics', max_concurrent: int = 5, thresholds: Optional[Dict[str, float]] = None, summary_generator: Optional[BaseSummary] = MetricSummary(), cache_manager: Optional[CacheManager] = None, error_config: ErrorConfig = ErrorConfig(), tracer: Optional[BaseTraceHandler] = None, dataset_name: Optional[str] = 'Metric Runner Dataset', enable_internal_caching: bool = True, trace_granularity: TraceGranularity = SEPARATE, flush_per_metric: bool = False)

Bases: RunnerMixin

Orchestrates the evaluation of multiple metrics against a dataset.

available_types property

available_types: List[str]

Returns a list of available (registered) metric runner types.

elapsed_time property

elapsed_time: Union[float, None]

Returns the total execution time for the last batch run.

summary property

summary: Union[Dict[str, Any], None]

Returns the summary of the last batch run.

execute_batch async

execute_batch(evaluation_inputs: Union[Dataset, List[DatasetItem], DataFrame], *, show_progress: bool = True) -> List[TestResult]

Executes all configured metrics against the provided dataset.

Trace granularity behavior: - SINGLE_TRACE: All metrics run under one parent trace (default) - SEPARATE: Each metric execution gets its own independent trace


CacheManager

axion._core.cache.CacheManager

CacheManager(config: CacheConfig = None)

Manages cache operations for both memory and disk, abstracting the backend.

get

get(key: str) -> Optional[Any]

Gets an item from the cache if use_cache is True.

set

set(key: str, value: Any)

Sets an item in the cache if write_cache is True.

close

close()

Closes the cache connection if applicable (for diskcache).


CacheConfig

axion._core.cache.CacheConfig dataclass

CacheConfig(use_cache: bool = True, write_cache: bool = True, cache_type: str = 'memory', cache_dir: Optional[str] = '.cache', cache_task: bool = True)

Configuration class for controlling caching behavior of metric evaluations.

Attributes:

  • use_cache (bool) –

    If True, attempts to read previously computed results from cache to avoid redundant computation.

  • write_cache (bool) –

    If True, writes newly computed metric results to cache for future use. Has no effect if use_cache is False.

  • cache_type (str) –

    Type of caching backend to use. - 'memory': Uses in-memory dictionary for caching (fast, but non-persistent). - 'disk': Writes cache to disk (persistent across runs).

  • cache_dir (Optional[str]) –

    Directory path where disk cache files will be stored. Only used when cache_type='disk'. Defaults to '.cache'.

  • cache_task (bool) –

    If True, enables caching at the task level (e.g., for full evaluation runs). If False, caching applies only at the metric level.