Dataset API Reference¶

Core data structures for building and managing evaluation datasets.

from axion import Dataset, DatasetItem

D

Dataset

Container for evaluation items. Supports JSON/CSV/DataFrame I/O, filtering, merging, and synthetic generation.

I

DatasetItem

Individual test case with query, expected/actual output, context, metadata, and conversation history.

Dataset¶

axion.dataset.Dataset `dataclass` ¶

Dataset(name: Optional[str] = None, description: str = '', version: str = '1.0', created_at: str = (lambda: current_datetime())(), metadata: Optional[str] = None, items: List[DatasetItem] = list(), _default_catch_all: str = ADDITIONAL_INPUT, _item_map: Dict[str, DatasetItem] = dict(), _synthetic_data: Optional[List[Dict[str, Any]]] = None)

Bases: RichSerializer

Represents a structured dataset for evaluation purposes, supporting both single and multi-turn items.

This class manages a collection of DatasetItem objects and provides functionality for loading, saving, filtering, and transforming datasets.

Attributes:

name (Optional[str]) –

Name of the dataset
description (str) –

Description of the dataset's purpose or contents
version (str) –

Version identifier
created_at (str) –

ISO format timestamp of creation
metadata (Optional[str]) –

Additional metadata (stored as JSON)
items (List[DatasetItem]) –

List of DatasetItem objects

items `class-attribute` `instance-attribute` ¶

items: List[DatasetItem] = field(default_factory=list)

create `classmethod` ¶

create(name: Optional[str] = None, items: Optional[List[Union[Dict[str, Any], str]]] = None, ignore_extra_keys: bool = False, **kwargs) -> Dataset

Creates a new dataset with initial items.

Parameters:

name (Optional[str], default: None ) –

Optional dataset name
items (Optional[List[Union[Dict[str, Any], str]]], default: None ) –

Optional list of items (dicts or strings)
ignore_extra_keys (bool, default: False ) –

If True, only use keys that match DatasetItem fields, ignoring any extra keys in dictionaries. Defaults to False.
**kwargs –

Additional parameters passed to the Dataset constructor.

add_item ¶

add_item(item: Union[DatasetItem, Dict[str, Any]], ignore_extra_keys: bool = False) -> DatasetItem

Add an item to the dataset, handling both single-turn and multi-turn items.

Parameters:

item (Union[DatasetItem, Dict[str, Any]]) –

Either a DatasetItem instance or a dictionary containing item data
ignore_extra_keys (bool, default: False ) –

If True, only use keys that match DatasetItem fields, ignoring any extra keys in the dictionary. Defaults to False.

Returns:

DatasetItem –

The DatasetItem instance that was added to the dataset

add_items ¶

add_items(items: List[Union[DatasetItem, Dict[str, Any]]], ignore_extra_keys: bool = False) -> List[DatasetItem]

Add multiple items to the dataset.

Parameters:

items (List[Union[DatasetItem, Dict[str, Any]]]) –

List of DatasetItem instances or dictionaries
ignore_extra_keys (bool, default: False ) –

If True, only use keys that match DatasetItem fields, ignoring any extra keys in dictionaries. Defaults to False.

Returns:

List[DatasetItem] –

List of added DatasetItem instances

get_item_by_id ¶

get_item_by_id(item_id: str) -> Optional[DatasetItem]

Retrieve an item by its ID.

Parameters:

item_id (str) –

ID of the item to find

Returns:

Optional[DatasetItem] –

DatasetItem if found, None otherwise

filter ¶

filter(condition: Callable[[DatasetItem], bool], dataset_name: Optional[str] = None) -> Dataset

Filters the dataset based on a condition and returns a new Dataset.

read_json `classmethod` ¶

read_json(file_path: Union[str, Path], name: Optional[str] = None, ignore_extra_keys: bool = False) -> Dataset

Creates a dataset from a JSON file, correctly parsing multi-turn conversations.

Parameters:

file_path (Union[str, Path]) –

Path to the JSON file
name (Optional[str], default: None ) –

Optional dataset name
ignore_extra_keys (bool, default: False ) –

If True, only use keys that match DatasetItem fields, ignoring any extra keys in dictionaries. Defaults to False.

read_csv `classmethod` ¶

read_csv(file_path: Union[str, Path], name: Optional[str] = None, column_mapping: Optional[Dict[str, str]] = None, ignore_extra_keys: bool = False, **kwargs) -> Dataset

Creates a dataset from a CSV file.

Parameters:

file_path (Union[str, Path]) –

Path to the CSV file
name (Optional[str], default: None ) –

Optional dataset name
column_mapping (Optional[Dict[str, str]], default: None ) –

Optional mapping to rename columns
ignore_extra_keys (bool, default: False ) –

If True, only use keys that match DatasetItem fields, ignoring any extra keys in dictionaries. Defaults to False.
**kwargs –

Additional parameters passed to read_dataframe.

read_dataframe `classmethod` ¶

read_dataframe(dataframe: DataFrame, name: Optional[str] = None, ignore_extra_keys: bool = False, **kwargs) -> Dataset

Creates a Dataset from a pandas DataFrame, safely deserializing JSON and Python literals. All fields must be included in DataFrame rows to correctly map to DatasetItem.

Parameters:

dataframe (DataFrame) –

Input DataFrame to read from.
name (Optional[str], default: None ) –

Optional dataset name.
ignore_extra_keys (bool, default: False ) –

If True, only use keys that match DatasetItem fields, ignoring any extra keys in dictionaries. Defaults to False.
**kwargs –

Additional parameters passed to the Dataset constructor.

Returns:

Dataset ( Dataset ) –

A populated Dataset instance.

to_json ¶

to_json(file_path: str) -> None

Save dataset to JSON file.

Parameters:

file_path (str) –

Path where to save the JSON file

to_csv ¶

to_csv(file_path: str, remove_aliased: bool = True) -> None

Save dataset to CSV file.

Parameters:

file_path (str) –

Path where to save the CSV file.
remove_aliased (bool, default: True ) –

If True remove aliased model field keys

to_dataframe ¶

to_dataframe(flatten_nested_json: bool = False, sep: str = '.', remove_aliased: bool = True) -> DataFrame

Converts the dataset to a pandas DataFrame, serializing complex fields to JSON strings.

Parameters:

flatten_nested_json (bool, default: False ) –

If True, nested objects will be flattened into separate columns. If False (default), they will be stored as JSON strings.
sep (str, default: '.' ) –

Separator for flattening.
remove_aliased (bool, default: True ) –

If True remove aliased model field keys.

Returns: A pandas DataFrame representing the dataset.

load_dataframe ¶

load_dataframe(dataframe: DataFrame) -> None

Load dataset items from a DataFrame.

Parameters:

dataframe (DataFrame) –

DataFrame containing dataset items.

get_summary ¶

get_summary() -> Optional[Dict[str, Any]]

Return summary statistics about the dataset.

get_summary_table ¶

get_summary_table(title: str = 'Dataset Summary', **kwargs) -> None

Return summary statistics about the dataset in rich table format.

Parameters:

title (str, default: 'Dataset Summary' ) –

Title for the log output.
**kwargs –

Additional arguments passed to the logging method.

execute_dataset_items_from_api ¶

execute_dataset_items_from_api(api_name: str, config: Union[str, Dict[str, Any], Path], max_concurrent: int = 5, show_progress: bool = True, retry_config: Optional[Union[Any, Dict[str, Any]]] = None, require_success: bool = False, additional_config: Optional[Dict[str, Any]] = None, **kwargs) -> None

Synchronously executes API calls using the specified API runner and attaches responses to the dataset items. Useful for batch-processing queries via a registered API.

Internally runs async code but exposes a sync interface to the user.

Parameters:

api_name (str) –

The name of the registered API to use for execution.
config (str | dict | Path) –

Config for authenticating with the API.
max_concurrent (int, default: 5 ) –

Max number of concurrent API requests. Defaults to 5.
show_progress (bool, default: True ) –

Whether to show progress bars using tqdm.
retry_config (RetryConfig | Dict, default: None ) –

Configuration for retrying logic.
require_success (bool, default: False ) –

(bool, optional): If True, remove items from dataset when response.status != 'success'.
additional_config (dict, default: None ) –

Extra configuration options for the runner.
**kwargs –

Extra arguments passed to the executor's execute_batch method.

merge_response_into_dataset_items `staticmethod` ¶

merge_response_into_dataset_items(items: List[DatasetItem], responses: List[RichBaseModel], require_success: bool = False) -> List[DatasetItem]

Updates DatasetItem instances with fields from corresponding APIResponseData.

Parameters:

items (List[DatasetItem]) –

List of DatasetItem objects to update.
responses (List[RichBaseModel]) –

List of APIResponseData objects with new runtime values.
require_success (bool, default: False ) –

If True, only keep items when response.status == 'success'.

Returns:

List[DatasetItem] –

List of DatasetItem objects that were successfully processed (if require_success=True)
List[DatasetItem] –

or all items (if require_success=False).

synthetic_generate_from_directory ¶

synthetic_generate_from_directory(directory_path: str, llm, params: GenerationParams, embed_model: None, max_concurrent: int = 3, show_progress: bool = True, **kwargs)

Generates synthetic QA data from a directory of documents.

This method uses the DocumentQAGenerator to process documents in the given directory, producing synthetic question-answer pairs using the provided language model (LLM) and generation parameters. The results are transformed into a format compatible with the dataset interface (query, expected_output) and added to the dataset.

Parameters:

directory_path (str) –

Path to the directory containing documents to process.
llm –

A language model instance that implements method for generation.
params (GenerationParams) –

A configuration object that defines generation settings such as number of QA pairs, difficulty, chunking behavior, etc.
embed_model (None) –

An embedding model used for semantic parsing.
max_concurrent (int, default: 3 ) –

The maximum number of documents to process concurrently. Defaults to 3.
show_progress (bool, default: True ) –

Whether to show progress bars using tqdm

DatasetItem¶

axion.dataset.DatasetItem ¶

DatasetItem(**data)

Bases: RichDatasetBaseModel

Represents a single evaluation data point, supporting both single-turn and multi-turn conversations.

This model is designed to store all relevant information required for evaluating LLM performance, including the input query, expected and actual outputs, retrieved context, evaluation criteria, and additional metadata. It supports both automated evaluation (binary judgments, critiques) and richer evaluation with tool usage tracking.

Attributes:

id (str) –

Unique identifier for the item (auto-generated if not provided).
query (Optional[str]) –

The input query or prompt for single-turn evaluation. Aliased as query for backward compatibility.
conversation (Optional[MultiTurnConversation]) –

Multi-turn conversation structure containing a sequence of messages. Aliased to conversation.
expected_output (Optional[str]) –

The reference/expected output for single-turn evaluation. Aliased to expected_output.
actual_output (Optional[str]) –

The system's generated response for the given query.
retrieved_content (Optional[List[str]]) –

A list of retrieved documents or contextual snippets used in generating the response.
latency (Optional[float]) –

Response time in seconds for generating the actual_output.
judgment (Optional[Union[str, int]]) –

A short, binary or categorical evaluation decision (e.g., 1/0, pass/fail, approve/decline).
critique (Optional[str]) –

A detailed explanation or rationale supporting the judgment.
conversation_extraction_strategy (Literal['first', 'last']) –

Defines whether to extract query and actual_output from the first or last messages in a multi-turn conversation. Defaults to 'last'.
acceptance_criteria (Optional[List[str]]) –

User-defined definitions of what qualifies as an acceptable/correct response.
additional_input (Dict[str, Any]) –

Arbitrary key-value pairs providing extra inputs for the evaluation context.
metadata (Optional[str]) –

Additional metadata as a JSON string for storing structured information.
trace (Optional[str]) –

Execution trace information, stored as a JSON string.
trace_id (Optional[str]) –

Optional[str]: Trace ID for the original observation from tracing provider.

observation_id: Optional[str]: Observation ID for the original observation from tracing provider. This is the ID of the specific observation that was evaluated. additional_output (Dict[str, Any]): Extra outputs generated by the system, useful for debugging or extended evaluation. tools_called (Optional[List[ToolCall]]): A list of tools the system actually invoked during response generation. expected_tools (Optional[List[ToolCall]]): A list of tools that should have been invoked according to the evaluation criteria. user_tags (List[str]): A list of custom tags to apply to all tool calls in the conversation.

id `class-attribute` `instance-attribute` ¶

id: str = Field(default_factory=lambda: str(uuid7()), alias='dataset_id')

query `property` `writable` ¶

query: Optional[str]

Provides a unified way to access the user's query based on the extraction strategy.

If the strategy is 'last' (default), it returns the last user message. If the strategy is 'first', it returns the first user message.

actual_output `class-attribute` `instance-attribute` ¶

actual_output: Optional[str] = None

expected_output `property` `writable` ¶

expected_output: Optional[str]

Provides a unified way to access the expected output.

If the item is a multi-turn conversation, this returns the reference_text if set. If it's a single-turn item, it returns the stored expected output.

Returns:

Optional[str] –

The expected output as a string, or None if not applicable.

conversation `property` ¶

conversation: Optional[MultiTurnConversation]

Provides direct access to the multi-turn conversation object.

retrieved_content `class-attribute` `instance-attribute` ¶

retrieved_content: Optional[List[str]] = None

latency `class-attribute` `instance-attribute` ¶

latency: Optional[float] = None

judgment `class-attribute` `instance-attribute` ¶

judgment: Optional[Union[str, int]] = Field(default=None, description='A short, binary decision on the output (e.g., 1/0, pass/fail, approve/decline).')

critique `class-attribute` `instance-attribute` ¶

critique: Optional[str] = Field(default=None, description='A detailed explanation or feedback supporting the judgment.')

acceptance_criteria `class-attribute` `instance-attribute` ¶

acceptance_criteria: Optional[List[str]] = None

additional_input `class-attribute` `instance-attribute` ¶

additional_input: Dict[str, Any] = Field(default_factory=dict)

additional_output `class-attribute` `instance-attribute` ¶

additional_output: Dict[str, Any] = Field(default_factory=dict)

metadata `class-attribute` `instance-attribute` ¶

metadata: Optional[str] = Field(None, alias='dataset_metadata')

trace `class-attribute` `instance-attribute` ¶

trace: Optional[str] = None

trace_id `class-attribute` `instance-attribute` ¶

trace_id: Optional[str] = None

observation_id `class-attribute` `instance-attribute` ¶

observation_id: Optional[str] = None

actual_ranking `class-attribute` `instance-attribute` ¶

actual_ranking: Optional[List[Dict[str, Any]]] = Field(default=None, description='Ordered list of retrieved items, e.g., [{"id": "doc1", "score": 0.9}, {"id": "doc2", "score": 0.8}]')

expected_ranking `class-attribute` `instance-attribute` ¶

expected_ranking: Optional[List[Dict[str, Any]]] = Field(default=None, description='Ground truth reference. For IR, e.g., [{"id": "doc1", "relevance": 1.0}, {"id": "doc_abc", "relevance": 0.5}]')

tools_called `class-attribute` `instance-attribute` ¶

tools_called: Optional[List[ToolCall]] = Field(default=None, description='Tools that were actually called by the system')

expected_tools `class-attribute` `instance-attribute` ¶

expected_tools: Optional[List[ToolCall]] = Field(default=None, description='Tools that should have been called')

user_tags `class-attribute` `instance-attribute` ¶

user_tags: List[str] = Field(default_factory=list, description='A list of custom tags to apply to all tool calls in the conversation.')

conversation_extraction_strategy `class-attribute` `instance-attribute` ¶

conversation_extraction_strategy: Literal['first', 'last'] = Field(default='last', description="Determines whether to extract 'query' and 'actual_output' from the 'first' or 'last' messages in a conversation.")

conversation_stats `property` ¶

conversation_stats: Optional[Dict[str, int]]

A dictionary of statistics about the conversation.

agent_trajectory `property` ¶

agent_trajectory: Optional[List[str]]

An ordered list of tool names called, representing the agent's execution path.

has_errors `property` ¶

has_errors: Union[bool, None]

Returns True if any tool message in the conversation is marked as an error.

to_transcript ¶

to_transcript() -> str

Converts the conversation messages into a human-readable string transcript.

If the item is not a multi-turn conversation, it returns an empty string.

Returns:

str –

A formatted string representing the entire conversation.

extract_by_tag ¶

extract_by_tag(tag: str) -> List[tuple[ToolCall, Optional[ToolMessage]]]

Extracts tool interactions from the conversation that match a specific tag.

Parameters:

tag (str) –

The tag to filter by (e.g., 'RAG', 'GUARDRAIL').

Returns:

List[tuple[ToolCall, Optional[ToolMessage]]] –

A list of tuples, where each tuple contains the tagged ToolCall
List[tuple[ToolCall, Optional[ToolMessage]]] –

and its corresponding ToolMessage (or None if not found).

get ¶

get(key: str, default: Any = None) -> Any

Get an attribute value by key, similar to dict.get(). This method correctly handles properties like 'query'.

Parameters:

key (str) –

The attribute name to retrieve.
default (Any, default: None ) –

Value to return if attribute doesn't exist.

Returns:

Any –

The attribute value or default if not found.

keys ¶

keys() -> List[str]

Return all public attribute names, including properties and aliases.

Returns:

List[str] –

A sorted list of public-facing field and property names.

values ¶

values() -> List[Any]

Return all public attribute values, corresponding to the .keys() method.

Returns:

List[Any] –

A list of values for the public-facing fields and properties.

items ¶

items() -> List[tuple]

Return all (key, value) pairs for public attributes.

Returns:

List[tuple] –

A list of (key, value) tuples for public-facing fields and properties.

subset ¶

subset(fields: List[str], keep_id: bool = True, copy_annotations: bool = False) -> DatasetItem

Create a new DatasetItem with only the specified fields, all others set to None/empty.

Parameters:

fields (List[str]) –

List of field names to keep (e.g., ['query', 'expected_output'])
keep_id (bool, default: True ) –

Whether to preserve the original ID (default: True)
copy_annotations (bool, default: False ) –

Whether to copy annotations (judgment and critique) to the new item (default: False)

Returns:

DatasetItem –

New DatasetItem instance with only specified fields populated

Example

Get item with only query and expected_output¶

subset_item = item.subset(['query', 'expected_output'])

Get item with query, actual_output, and preserve annotations¶

subset_item = item.subset(['query', 'actual_output'], copy_annotations=True)

evaluation_fields ¶

evaluation_fields() -> DatasetItem

Extract just the evaluation fields.

update ¶

update(other: Union[DatasetItem, Dict[str, Any]], overwrite: bool = True) -> DatasetItem

Update this DatasetItem with values from another DatasetItem or dictionary. This method correctly handles aliases and special merge logic for lists/dicts.

Parameters:

other (Union[DatasetItem, Dict[str, Any]]) –

Another DatasetItem instance or a dictionary to update from.
overwrite (bool, default: True ) –

If True, overwrite existing values. If False, only fill empty fields.

Returns:

DatasetItem –

The updated DatasetItem instance (self).

update_runtime ¶

update_runtime(**kwargs) -> DatasetItem

Update runtime-related fields such as actual_output or retrieved_content.

Parameters:

**kwargs –

Runtime fields to update.

Returns:

DatasetItem –

Updated DatasetItem (self).

merge_metadata ¶

merge_metadata(metadata: Union[str, Dict[str, Any]]) -> DatasetItem

Merge new metadata into the existing metadata field.

Parameters:

metadata (Union[str, Dict[str, Any]]) –

A dictionary or JSON string to merge.

Returns:

DatasetItem –

Updated DatasetItem (self).

from_dict `classmethod` ¶

from_dict(data: Dict[str, Any]) -> DatasetItem

Create a DatasetItem from a dictionary.

Parameters:

data (Dict[str, Any]) –

Dictionary containing item data

Returns:

DatasetItem –

New DatasetItem instance

Working with Datasets Guide Running Evaluations

Dataset API Reference¶

Dataset¶

axion.dataset.Dataset dataclass ¶

items class-attribute instance-attribute ¶

create classmethod ¶

add_item ¶

add_items ¶

get_item_by_id ¶

filter ¶

read_json classmethod ¶

read_csv classmethod ¶

read_dataframe classmethod ¶

to_json ¶

to_csv ¶

to_dataframe ¶

load_dataframe ¶

get_summary ¶

get_summary_table ¶

execute_dataset_items_from_api ¶

merge_response_into_dataset_items staticmethod ¶

synthetic_generate_from_directory ¶

DatasetItem¶

axion.dataset.DatasetItem ¶

id class-attribute instance-attribute ¶

query property writable ¶

actual_output class-attribute instance-attribute ¶

expected_output property writable ¶

conversation property ¶

retrieved_content class-attribute instance-attribute ¶

latency class-attribute instance-attribute ¶

judgment class-attribute instance-attribute ¶

critique class-attribute instance-attribute ¶

acceptance_criteria class-attribute instance-attribute ¶

additional_input class-attribute instance-attribute ¶

additional_output class-attribute instance-attribute ¶

metadata class-attribute instance-attribute ¶

trace class-attribute instance-attribute ¶

trace_id class-attribute instance-attribute ¶

observation_id class-attribute instance-attribute ¶

actual_ranking class-attribute instance-attribute ¶

expected_ranking class-attribute instance-attribute ¶

tools_called class-attribute instance-attribute ¶

expected_tools class-attribute instance-attribute ¶

user_tags class-attribute instance-attribute ¶

conversation_extraction_strategy class-attribute instance-attribute ¶

conversation_stats property ¶

agent_trajectory property ¶

has_errors property ¶

to_transcript ¶

extract_by_tag ¶

get ¶

keys ¶

values ¶

items ¶

subset ¶

Get item with only query and expected_output¶

Get item with query, actual_output, and preserve annotations¶

evaluation_fields ¶

update ¶

update_runtime ¶

merge_metadata ¶

from_dict classmethod ¶

axion.dataset.Dataset `dataclass` ¶

items `class-attribute` `instance-attribute` ¶

create `classmethod` ¶

read_json `classmethod` ¶

read_csv `classmethod` ¶

read_dataframe `classmethod` ¶

merge_response_into_dataset_items `staticmethod` ¶

id `class-attribute` `instance-attribute` ¶

query `property` `writable` ¶

actual_output `class-attribute` `instance-attribute` ¶

expected_output `property` `writable` ¶

conversation `property` ¶

retrieved_content `class-attribute` `instance-attribute` ¶

latency `class-attribute` `instance-attribute` ¶

judgment `class-attribute` `instance-attribute` ¶

critique `class-attribute` `instance-attribute` ¶

acceptance_criteria `class-attribute` `instance-attribute` ¶

additional_input `class-attribute` `instance-attribute` ¶

additional_output `class-attribute` `instance-attribute` ¶

metadata `class-attribute` `instance-attribute` ¶

trace `class-attribute` `instance-attribute` ¶

trace_id `class-attribute` `instance-attribute` ¶

observation_id `class-attribute` `instance-attribute` ¶

actual_ranking `class-attribute` `instance-attribute` ¶

expected_ranking `class-attribute` `instance-attribute` ¶

tools_called `class-attribute` `instance-attribute` ¶

expected_tools `class-attribute` `instance-attribute` ¶

user_tags `class-attribute` `instance-attribute` ¶

conversation_extraction_strategy `class-attribute` `instance-attribute` ¶

conversation_stats `property` ¶

agent_trajectory `property` ¶

has_errors `property` ¶

from_dict `classmethod` ¶