Schema API Reference¶
Result types and evaluation schemas returned by runners and metrics.
MetricScore¶
axion.schema.MetricScore ¶
Bases: RichBaseModel
Standardized data model for a single metric evaluation result. Captures the computed score, the logic behind it, thresholds used, and any metadata useful for debugging or reporting.
EvaluationResult¶
axion.schema.EvaluationResult
dataclass
¶
EvaluationResult(run_id: str, evaluation_name: Optional[str], timestamp: str, results: List[TestResult], summary: Dict[str, Any] = dict(), metadata: Dict[str, Any] = dict())
Captures the full result of an evaluation run across multiple test cases and metrics.
Attributes:
-
run_id(str) –A unique identifier for this specific evaluation run. Typically generated per execution.
-
evaluation_name(Optional[str]) –Optional name for the experiment or test campaign (e.g., "Lead Scoring v2 A/B").
-
timestamp(str) –ISO-formatted timestamp indicating when the evaluation was run. Can be used for sorting or audit logging.
-
results(List[TestResult]) –A list of TestResult objects, each representing the evaluation output for a single test case across one or more metrics.
-
summary(Dict[str, Any]) –Summary of the TestResult objects, representing the evaluation output across each metric.
-
metadata(Dict[str, Any]) –Arbitrary metadata such as configuration info, evaluator identity, model version, dataset name, or custom flags for internal use.
to_dataframe ¶
to_dataframe(by_alias: bool = True, id_as_index: bool = False, include_test_case: bool = True, include_run_metadata: bool = True, column_order: Optional[List[str]] = None, rename_columns: bool = True) -> DataFrame
Flattens the entire evaluation result into a single pandas DataFrame.
Parameters:
-
by_alias(bool, default:True) –Whether to use field aliases in the output. When True, MetricScore fields use aliases (id -> metric_id, metadata -> metric_metadata).
-
id_as_index(bool, default:False) –If True, sets the test_case
idas the DataFrame index. -
include_test_case(bool, default:True) –Whether to include test_case fields in the output.
-
include_run_metadata(bool, default:True) –Whether to include run-level metadata.
-
column_order(list, default:None) –Output column ordering.
-
rename_columns(bool, default:True) –Rename columns to match model_arena format.
Returns:
-
DataFrame–pd.DataFrame: Flattened view of all metrics with test case and run context.
to_normalized_dataframes ¶
to_normalized_dataframes(by_alias: bool = True, include_run_metadata: bool = True, dataset_column_order: Optional[List[str]] = None, metrics_column_order: Optional[List[str]] = None, rename_columns: bool = True, include_computed_fields: bool = True) -> NormalizedDataFrames
Returns two normalized DataFrames following data engineering best practices.
Unlike to_dataframe() which creates one row per (test_case, metric) combination with duplicated dataset fields, this method returns: 1. Dataset Items Table: One row per DatasetItem (inputs/ground truth) 2. Metric Results Table: One row per MetricScore, with FK to dataset item
Parameters:
-
by_alias(bool, default:True) –Whether to use field aliases in the output. When True, MetricScore fields use aliases (id -> metric_id, metadata -> metric_metadata).
-
include_run_metadata(bool, default:True) –Whether to include run-level metadata in metrics table.
-
dataset_column_order(list, default:None) –Output column ordering for dataset items table.
-
metrics_column_order(list, default:None) –Output column ordering for metric results table.
-
rename_columns(bool, default:True) –Rename columns to match model_arena format (name -> metric_name, score -> metric_score, type -> metric_type).
-
include_computed_fields(bool, default:True) –Whether to include computed fields from DatasetItem.
Returns:
-
NormalizedDataFrames(NormalizedDataFrames) –A named tuple containing: - dataset_items: DataFrame with one row per DatasetItem - metric_results: DataFrame with one row per MetricScore, with 'id' FK
Note
You can merge the two DataFrames to get a denormalized view similar to to_dataframe(). The FK column name depends on the by_alias parameter::
# With by_alias=False (default)
merged_df = metrics_df.merge(dataset_df, on='id', how='left')
# With by_alias=True
merged_df = metrics_df.merge(dataset_df, on='dataset_id', how='left')
The merged result has the same columns as to_dataframe(). Both metadata fields use aliases to avoid conflicts: DatasetItem's metadata uses 'dataset_metadata' and MetricScore's metadata uses 'metric_metadata'. The only difference is column order. Use how='left' to keep metric rows even when test_case was None.
The normalized approach is better when you want to avoid data duplication or need to work with the data in a relational/normalized way.
to_latency_plot ¶
to_latency_plot(col_name: str = 'latency', id_col: str = 'id', bins: int = 30, show_legend: bool = True, show_stats_panel: bool = True, figsize: Tuple[int, int] = (16, 9), return_plot: bool = False, show_plot: bool = True, output_path: Optional[str] = None, plot_title: str = 'Latency Distribution', color_palette: Optional[Dict[str, str]] = None) -> Union[DataFrame, Tuple[DataFrame, Optional[Any], Optional[Any]]]
Analyzes and visualizes latency distribution.
Parameters:
-
col_name(str, default:'latency') –Name of the column containing latency values.
-
id_col(str, default:'id') –Unique identifier for test cases (used to deduplicate latency).
-
bins(int, default:30) –Number of histogram bins.
-
show_legend(bool, default:True) –If True, show plot legend,
-
show_stats_panel(bool, default:True) –If True, show stats panel,
-
figsize(Tuple[int, int], default:(16, 9)) –Size of the matplotlib figure.
-
return_plot(bool, default:False) –If True, returns (stats_df, fig, ax). If False, returns stats_df.
-
show_plot(bool, default:True) –Whether to render the plot using plt.show() (or display in NB).
-
output_path(Optional[str], default:None) –If provided, saves the plot to this file path.
-
plot_title(str, default:'Latency Distribution') –Descriptive name for the latency plot title (default: "Latency Distribution")
-
color_palette(Optional[Dict[str, str]], default:None) –Custom colors for the LatencyAnalyzer.
Returns:
-
Union[DataFrame, Tuple[DataFrame, Optional[Any], Optional[Any]]]–pd.DataFrame or Tuple[pd.DataFrame, Figure, Axes]
to_scorecard ¶
to_scorecard(llm: Optional[LLMRunnable] = None, metric_definitions: dict = None, explanation_callback: Optional[Callable] = None, instruction: Optional[str] = None, max_concurrent: int = 10, output_path: Optional[str] = None, display_in_notebook: bool = False, return_html: bool = False, return_styled_df: bool = True, id_col: str = 'metric_name', parent_col: str = 'parent', value_cols: List[str] = None, group_meta_cols: List[str] = None) -> Union[str, DataFrame, None]
Generates a hierarchical scorecard report using the evaluation results.
This method creates a visual performance breakdown. It can display the report interactively in a notebook, save it as an HTML file, or return the styled object/HTML string for custom use.
Parameters:
-
llm(Optional[LLMRunnable], default:None) –Custom LLM instance to use for generating qualitative explanations.
-
metric_definitions(dict, default:None) –Dictionary mapping metric names to static descriptions or templates.
-
explanation_callback(callable, default:None) –Custom function
f(name, score, type)to generate explanations manually. -
instruction(Optional[str], default:None) –System prompt override for the explanation generation LLM.
-
max_concurrent(int, default:10) –Maximum number of parallel LLM calls for batch processing explanations.
-
output_path(Optional[str], default:None) –File path to save the generated HTML report.
-
display_in_notebook(bool, default:False) –If True, renders the styled dataframe (or HTML) directly in Jupyter/IPython.
-
return_html(bool, default:False) –If True, returns the raw HTML string of the report.
-
return_styled_df(bool, default:True) –If True, returns the pandas Styler object for further customization.
-
id_col(str, default:'metric_name') –Column name representing the unique node identifier (default: 'metric_name').
-
parent_col(str, default:'parent') –Column name representing the parent node identifier (default: 'parent').
-
value_cols(List[str], default:None) –List of columns to aggregate values for (e.g., ['metric_score', 'weight']).
-
group_meta_cols(List[str], default:None) –List of metadata columns to include in grouping (e.g., ['metric_type']).
Returns:
-
Union[str, DataFrame, None]–Union[str, Styler, None]: - The HTML string if
return_html=True. - The pandas Styler object ifreturn_styled_df=True. - None otherwise (default behavior is just to display or save).
publish_to_observability ¶
publish_to_observability(loader: Optional[Any] = None, observation_id_field: Optional[str] = 'observation_id', flush: bool = True, tags: Optional[List[str]] = None, metric_names: Optional[List[str]] = None) -> Dict[str, int]
Publish evaluation scores to an observability platform.
Uses a trace loader to publish scores. By default, uses LangfuseTraceLoader.
Parameters:
-
loader(Optional[Any], default:None) –A trace loader instance (e.g., LangfuseTraceLoader, OpikTraceLoader). If None, creates a new LangfuseTraceLoader using environment variables.
-
observation_id_field(Optional[str], default:'observation_id') –Field name on DatasetItem containing the observation/span ID. If provided, scores attach to that specific observation within the trace. If None, scores attach to the trace itself. Default: 'observation_id'.
-
flush(bool, default:True) –Whether to flush the client after uploading. Defaults to True.
-
tags(Optional[List[str]], default:None) –Optional list of tags to attach to all scores as metadata. Falls back to LANGFUSE_TAGS env var if not provided.
-
metric_names(Optional[List[str]], default:None) –Optional list of metric names to upload. If provided, only scores whose metric name matches are uploaded.
Note
Environment cannot be set when pushing scores to existing traces. To set environment, configure it at client initialization when creating traces (via LANGFUSE_ENVIRONMENT or LANGFUSE_TRACING_ENVIRONMENT env vars or the environment parameter in LangfuseTracer).
Returns:
-
Dict[str, int]–Dict with counts: {'uploaded': N, 'skipped': M} - uploaded: Number of scores successfully pushed - skipped: Number of scores skipped (missing trace_id or invalid score)
Example
from axion._core.tracing.loaders import LangfuseTraceLoader
Using default Langfuse loader¶
stats = result.publish_to_observability()
Using explicit loader with tags¶
loader = LangfuseTraceLoader() stats = result.publish_to_observability( ... loader=loader, ... tags=['prod', 'v1.0'] ... )
Attach scores to traces only (no observation)¶
stats = result.publish_to_observability(observation_id_field=None)
publish_as_experiment ¶
publish_as_experiment(loader: Optional[Any] = None, dataset_name: Optional[str] = None, run_name: Optional[str] = None, run_metadata: Optional[Dict[str, Any]] = None, flush: bool = True, tags: Optional[List[str]] = None, score_on_runtime_traces: bool = False, link_to_traces: bool = False, metric_names: Optional[List[str]] = None) -> Dict[str, Any]
Publish evaluation results to Langfuse as a dataset experiment.
This method creates a complete experiment in Langfuse with a dataset,
dataset items, experiment runs, and scores. Unlike publish_to_observability(),
it does not require existing traces - it creates everything from scratch.
Parameters:
-
loader(Optional[Any], default:None) –A LangfuseTraceLoader instance. If None, creates a new one using environment variables.
-
dataset_name(Optional[str], default:None) –Name for the Langfuse dataset. Defaults to evaluation_name or generates one based on run_id.
-
run_name(Optional[str], default:None) –Name for the experiment run. Defaults to "{dataset_name}-{run_id}" pattern.
-
run_metadata(Optional[Dict[str, Any]], default:None) –Optional metadata to attach to the experiment run.
-
flush(bool, default:True) –Whether to flush the client after uploading. Defaults to True.
-
tags(Optional[List[str]], default:None) –Optional list of tags to attach to all scores as metadata.
-
score_on_runtime_traces(bool, default:False) –If True, skip creating per-item "Dataset run" traces and instead attach scores to existing runtime traces via
trace_id/observation_id. Takes precedence over link_to_traces if both are True. -
link_to_traces(bool, default:False) –If True, link experiment runs to existing traces via the low-level API instead of creating new "Dataset run" traces. This allows experiment runs to appear linked to the original evaluation traces in Langfuse UI. Falls back to creating new traces if trace_id is not available. Ignored if score_on_runtime_traces is True.
-
metric_names(Optional[List[str]], default:None) –Optional list of metric names to upload. If provided, only scores whose metric name matches are uploaded.
Returns:
-
Dict[str, Any]–Dict with statistics: - dataset_name: Name of the created/used dataset - run_name: Name of the experiment run - items_created: Number of dataset items created - runs_created: Number of experiment runs created - scores_uploaded: Number of scores attached - scores_skipped: Number of scores skipped (None/NaN values) - errors: List of error messages encountered
Example
from axion import evaluation_runner from axion.metrics import Faithfulness, AnswerRelevancy
Run evaluation¶
results = evaluation_runner( ... evaluation_inputs=dataset, ... scoring_config=config, ... evaluation_name="RAG Evaluation", ... )
Upload to Langfuse as experiment¶
stats = results.publish_as_experiment( ... dataset_name="my-rag-dataset", ... run_name="experiment-v1", ... tags=["production"] ... )
print(f"Uploaded {stats['scores_uploaded']} scores to {stats['dataset_name']}")
expand_multi_metrics ¶
expand_multi_metrics(expansion_map: Optional[Dict[str, Callable[[MetricScore], List[MetricScore]]]] = None, in_place: bool = False) -> EvaluationResult
Expand multi-metric results using custom expansion functions.
This method allows post-hoc expansion of metric scores that contain nested data in their metadata or signals. It's useful when you have existing results that weren't exploded at runtime, or when you want to apply custom expansion logic.
Parameters:
-
expansion_map(Optional[Dict[str, Callable[[MetricScore], List[MetricScore]]]], default:None) –Dict mapping metric names to expansion functions. Each function takes a MetricScore and returns List[MetricScore]. If None, no expansion is performed.
-
in_place(bool, default:False) –If True, modifies this instance. If False, returns a new copy.
TestResult¶
axion.schema.TestResult
dataclass
¶
TestResult(test_case: Optional[DatasetItem], score_results: List[MetricScore], metadata: Optional[Dict[str, Any]] = dict())
Represents the result of evaluating a single test case using one or more evaluation metrics.
Attributes:
-
test_case(DatasetItem) –The input test case containing query, expected output, and other context. This forms the basis for which all metrics are applied.
-
score_results(List[MetricScore]) –A list of evaluation results returned from applying different metrics to this test case. Each MetricScore includes a score, explanation, and threshold comparison.
-
metadata(Optional[Dict[str, Any]]) –Optional metadata for storing extra context such as timestamps, evaluator info, experiment variant, evaluation notes, or model config parameters.
NormalizedDataFrames¶
axion.schema.NormalizedDataFrames ¶
Bases: NamedTuple
Return type for to_normalized_dataframes() method.
Provides a normalized data model with two separate DataFrames: - dataset_items: One row per DatasetItem (inputs/ground truth) - metric_results: One row per MetricScore, with FK to dataset item
ErrorConfig¶
axion.schema.ErrorConfig
dataclass
¶
Configuration class for controlling error handling during metric execution.
Attributes:
-
ignore_errors(bool) –If True, any exceptions raised during metric execution will be caught and suppressed. The metric will return a placeholder result (e.g., None or NaN) instead of failing the entire evaluation. Use this to allow evaluations to proceed even if some metrics occasionally fail.
-
skip_on_missing_params(bool) –If True, metrics will be skipped entirely when required input fields are missing from the data. This is useful when running multiple metrics over heterogeneous data where not all fields are always present. If False, the metric will raise an error if required inputs are missing.