Skip to content

Dataset API Reference

Core data structures for building and managing evaluation datasets.

from axion import Dataset, DatasetItem
D

Dataset

Container for evaluation items. Supports JSON/CSV/DataFrame I/O, filtering, merging, and synthetic generation.

I

DatasetItem

Individual test case with query, expected/actual output, context, metadata, and conversation history.


Dataset

axion.dataset.Dataset dataclass

Dataset(name: Optional[str] = None, description: str = '', version: str = '1.0', created_at: str = (lambda: current_datetime())(), metadata: Optional[str] = None, items: List[DatasetItem] = list(), _default_catch_all: str = ADDITIONAL_INPUT, _item_map: Dict[str, DatasetItem] = dict(), _synthetic_data: Optional[List[Dict[str, Any]]] = None)

Bases: RichSerializer

Represents a structured dataset for evaluation purposes, supporting both single and multi-turn items.

This class manages a collection of DatasetItem objects and provides functionality for loading, saving, filtering, and transforming datasets.

Attributes:

  • name (Optional[str]) –

    Name of the dataset

  • description (str) –

    Description of the dataset's purpose or contents

  • version (str) –

    Version identifier

  • created_at (str) –

    ISO format timestamp of creation

  • metadata (Optional[str]) –

    Additional metadata (stored as JSON)

  • items (List[DatasetItem]) –

    List of DatasetItem objects

items class-attribute instance-attribute

items: List[DatasetItem] = field(default_factory=list)

create classmethod

create(name: Optional[str] = None, items: Optional[List[Union[Dict[str, Any], str]]] = None, ignore_extra_keys: bool = False, **kwargs) -> Dataset

Creates a new dataset with initial items.

Parameters:

  • name (Optional[str], default: None ) –

    Optional dataset name

  • items (Optional[List[Union[Dict[str, Any], str]]], default: None ) –

    Optional list of items (dicts or strings)

  • ignore_extra_keys (bool, default: False ) –

    If True, only use keys that match DatasetItem fields, ignoring any extra keys in dictionaries. Defaults to False.

  • **kwargs

    Additional parameters passed to the Dataset constructor.

add_item

add_item(item: Union[DatasetItem, Dict[str, Any]], ignore_extra_keys: bool = False) -> DatasetItem

Add an item to the dataset, handling both single-turn and multi-turn items.

Parameters:

  • item (Union[DatasetItem, Dict[str, Any]]) –

    Either a DatasetItem instance or a dictionary containing item data

  • ignore_extra_keys (bool, default: False ) –

    If True, only use keys that match DatasetItem fields, ignoring any extra keys in the dictionary. Defaults to False.

Returns:

  • DatasetItem

    The DatasetItem instance that was added to the dataset

add_items

add_items(items: List[Union[DatasetItem, Dict[str, Any]]], ignore_extra_keys: bool = False) -> List[DatasetItem]

Add multiple items to the dataset.

Parameters:

  • items (List[Union[DatasetItem, Dict[str, Any]]]) –

    List of DatasetItem instances or dictionaries

  • ignore_extra_keys (bool, default: False ) –

    If True, only use keys that match DatasetItem fields, ignoring any extra keys in dictionaries. Defaults to False.

Returns:

  • List[DatasetItem]

    List of added DatasetItem instances

get_item_by_id

get_item_by_id(item_id: str) -> Optional[DatasetItem]

Retrieve an item by its ID.

Parameters:

  • item_id (str) –

    ID of the item to find

Returns:

  • Optional[DatasetItem]

    DatasetItem if found, None otherwise

filter

filter(condition: Callable[[DatasetItem], bool], dataset_name: Optional[str] = None) -> Dataset

Filters the dataset based on a condition and returns a new Dataset.

read_json classmethod

read_json(file_path: Union[str, Path], name: Optional[str] = None, ignore_extra_keys: bool = False) -> Dataset

Creates a dataset from a JSON file, correctly parsing multi-turn conversations.

Parameters:

  • file_path (Union[str, Path]) –

    Path to the JSON file

  • name (Optional[str], default: None ) –

    Optional dataset name

  • ignore_extra_keys (bool, default: False ) –

    If True, only use keys that match DatasetItem fields, ignoring any extra keys in dictionaries. Defaults to False.

read_csv classmethod

read_csv(file_path: Union[str, Path], name: Optional[str] = None, column_mapping: Optional[Dict[str, str]] = None, ignore_extra_keys: bool = False, **kwargs) -> Dataset

Creates a dataset from a CSV file.

Parameters:

  • file_path (Union[str, Path]) –

    Path to the CSV file

  • name (Optional[str], default: None ) –

    Optional dataset name

  • column_mapping (Optional[Dict[str, str]], default: None ) –

    Optional mapping to rename columns

  • ignore_extra_keys (bool, default: False ) –

    If True, only use keys that match DatasetItem fields, ignoring any extra keys in dictionaries. Defaults to False.

  • **kwargs

    Additional parameters passed to read_dataframe.

read_dataframe classmethod

read_dataframe(dataframe: DataFrame, name: Optional[str] = None, ignore_extra_keys: bool = False, **kwargs) -> Dataset

Creates a Dataset from a pandas DataFrame, safely deserializing JSON and Python literals. All fields must be included in DataFrame rows to correctly map to DatasetItem.

Parameters:

  • dataframe (DataFrame) –

    Input DataFrame to read from.

  • name (Optional[str], default: None ) –

    Optional dataset name.

  • ignore_extra_keys (bool, default: False ) –

    If True, only use keys that match DatasetItem fields, ignoring any extra keys in dictionaries. Defaults to False.

  • **kwargs

    Additional parameters passed to the Dataset constructor.

Returns:

  • Dataset ( Dataset ) –

    A populated Dataset instance.

to_json

to_json(file_path: str) -> None

Save dataset to JSON file.

Parameters:

  • file_path (str) –

    Path where to save the JSON file

to_csv

to_csv(file_path: str, remove_aliased: bool = True) -> None

Save dataset to CSV file.

Parameters:

  • file_path (str) –

    Path where to save the CSV file.

  • remove_aliased (bool, default: True ) –

    If True remove aliased model field keys

to_dataframe

to_dataframe(flatten_nested_json: bool = False, sep: str = '.', remove_aliased: bool = True) -> DataFrame

Converts the dataset to a pandas DataFrame, serializing complex fields to JSON strings.

Parameters:

  • flatten_nested_json (bool, default: False ) –

    If True, nested objects will be flattened into separate columns. If False (default), they will be stored as JSON strings.

  • sep (str, default: '.' ) –

    Separator for flattening.

  • remove_aliased (bool, default: True ) –

    If True remove aliased model field keys.

Returns: A pandas DataFrame representing the dataset.

load_dataframe

load_dataframe(dataframe: DataFrame) -> None

Load dataset items from a DataFrame.

Parameters:

  • dataframe (DataFrame) –

    DataFrame containing dataset items.

get_summary

get_summary() -> Optional[Dict[str, Any]]

Return summary statistics about the dataset.

get_summary_table

get_summary_table(title: str = 'Dataset Summary', **kwargs) -> None

Return summary statistics about the dataset in rich table format.

Parameters:

  • title (str, default: 'Dataset Summary' ) –

    Title for the log output.

  • **kwargs

    Additional arguments passed to the logging method.

execute_dataset_items_from_api

execute_dataset_items_from_api(api_name: str, config: Union[str, Dict[str, Any], Path], max_concurrent: int = 5, show_progress: bool = True, retry_config: Optional[Union[Any, Dict[str, Any]]] = None, require_success: bool = False, additional_config: Optional[Dict[str, Any]] = None, **kwargs) -> None

Synchronously executes API calls using the specified API runner and attaches responses to the dataset items. Useful for batch-processing queries via a registered API.

Internally runs async code but exposes a sync interface to the user.

Parameters:

  • api_name (str) –

    The name of the registered API to use for execution.

  • config (str | dict | Path) –

    Config for authenticating with the API.

  • max_concurrent (int, default: 5 ) –

    Max number of concurrent API requests. Defaults to 5.

  • show_progress (bool, default: True ) –

    Whether to show progress bars using tqdm.

  • retry_config (RetryConfig | Dict, default: None ) –

    Configuration for retrying logic.

  • require_success (bool, default: False ) –

    (bool, optional): If True, remove items from dataset when response.status != 'success'.

  • additional_config (dict, default: None ) –

    Extra configuration options for the runner.

  • **kwargs

    Extra arguments passed to the executor's execute_batch method.

merge_response_into_dataset_items staticmethod

merge_response_into_dataset_items(items: List[DatasetItem], responses: List[RichBaseModel], require_success: bool = False) -> List[DatasetItem]

Updates DatasetItem instances with fields from corresponding APIResponseData.

Parameters:

  • items (List[DatasetItem]) –

    List of DatasetItem objects to update.

  • responses (List[RichBaseModel]) –

    List of APIResponseData objects with new runtime values.

  • require_success (bool, default: False ) –

    If True, only keep items when response.status == 'success'.

Returns:

  • List[DatasetItem]

    List of DatasetItem objects that were successfully processed (if require_success=True)

  • List[DatasetItem]

    or all items (if require_success=False).

synthetic_generate_from_directory

synthetic_generate_from_directory(directory_path: str, llm, params: GenerationParams, embed_model: None, max_concurrent: int = 3, show_progress: bool = True, **kwargs)

Generates synthetic QA data from a directory of documents.

This method uses the DocumentQAGenerator to process documents in the given directory, producing synthetic question-answer pairs using the provided language model (LLM) and generation parameters. The results are transformed into a format compatible with the dataset interface (query, expected_output) and added to the dataset.

Parameters:

  • directory_path (str) –

    Path to the directory containing documents to process.

  • llm

    A language model instance that implements method for generation.

  • params (GenerationParams) –

    A configuration object that defines generation settings such as number of QA pairs, difficulty, chunking behavior, etc.

  • embed_model (None) –

    An embedding model used for semantic parsing.

  • max_concurrent (int, default: 3 ) –

    The maximum number of documents to process concurrently. Defaults to 3.

  • show_progress (bool, default: True ) –

    Whether to show progress bars using tqdm


DatasetItem

axion.dataset.DatasetItem

DatasetItem(**data)

Bases: RichDatasetBaseModel

Represents a single evaluation data point, supporting both single-turn and multi-turn conversations.

This model is designed to store all relevant information required for evaluating LLM performance, including the input query, expected and actual outputs, retrieved context, evaluation criteria, and additional metadata. It supports both automated evaluation (binary judgments, critiques) and richer evaluation with tool usage tracking.

Attributes:

  • id (str) –

    Unique identifier for the item (auto-generated if not provided).

  • query (Optional[str]) –

    The input query or prompt for single-turn evaluation. Aliased as query for backward compatibility.

  • conversation (Optional[MultiTurnConversation]) –

    Multi-turn conversation structure containing a sequence of messages. Aliased to conversation.

  • expected_output (Optional[str]) –

    The reference/expected output for single-turn evaluation. Aliased to expected_output.

  • actual_output (Optional[str]) –

    The system's generated response for the given query.

  • retrieved_content (Optional[List[str]]) –

    A list of retrieved documents or contextual snippets used in generating the response.

  • latency (Optional[float]) –

    Response time in seconds for generating the actual_output.

  • judgment (Optional[Union[str, int]]) –

    A short, binary or categorical evaluation decision (e.g., 1/0, pass/fail, approve/decline).

  • critique (Optional[str]) –

    A detailed explanation or rationale supporting the judgment.

  • conversation_extraction_strategy (Literal['first', 'last']) –

    Defines whether to extract query and actual_output from the first or last messages in a multi-turn conversation. Defaults to 'last'.

  • acceptance_criteria (Optional[List[str]]) –

    User-defined definitions of what qualifies as an acceptable/correct response.

  • additional_input (Dict[str, Any]) –

    Arbitrary key-value pairs providing extra inputs for the evaluation context.

  • metadata (Optional[str]) –

    Additional metadata as a JSON string for storing structured information.

  • trace (Optional[str]) –

    Execution trace information, stored as a JSON string.

  • trace_id (Optional[str]) –

    Optional[str]: Trace ID for the original observation from tracing provider.

observation_id: Optional[str]: Observation ID for the original observation from tracing provider. This is the ID of the specific observation that was evaluated. additional_output (Dict[str, Any]): Extra outputs generated by the system, useful for debugging or extended evaluation. tools_called (Optional[List[ToolCall]]): A list of tools the system actually invoked during response generation. expected_tools (Optional[List[ToolCall]]): A list of tools that should have been invoked according to the evaluation criteria. user_tags (List[str]): A list of custom tags to apply to all tool calls in the conversation.

id class-attribute instance-attribute

id: str = Field(default_factory=lambda: str(uuid7()), alias='dataset_id')

query property writable

query: Optional[str]

Provides a unified way to access the user's query based on the extraction strategy.

If the strategy is 'last' (default), it returns the last user message. If the strategy is 'first', it returns the first user message.

actual_output class-attribute instance-attribute

actual_output: Optional[str] = None

expected_output property writable

expected_output: Optional[str]

Provides a unified way to access the expected output.

If the item is a multi-turn conversation, this returns the reference_text if set. If it's a single-turn item, it returns the stored expected output.

Returns:

  • Optional[str]

    The expected output as a string, or None if not applicable.

conversation property

conversation: Optional[MultiTurnConversation]

Provides direct access to the multi-turn conversation object.

retrieved_content class-attribute instance-attribute

retrieved_content: Optional[List[str]] = None

latency class-attribute instance-attribute

latency: Optional[float] = None

judgment class-attribute instance-attribute

judgment: Optional[Union[str, int]] = Field(default=None, description='A short, binary decision on the output (e.g., 1/0, pass/fail, approve/decline).')

critique class-attribute instance-attribute

critique: Optional[str] = Field(default=None, description='A detailed explanation or feedback supporting the judgment.')

acceptance_criteria class-attribute instance-attribute

acceptance_criteria: Optional[List[str]] = None

additional_input class-attribute instance-attribute

additional_input: Dict[str, Any] = Field(default_factory=dict)

additional_output class-attribute instance-attribute

additional_output: Dict[str, Any] = Field(default_factory=dict)

metadata class-attribute instance-attribute

metadata: Optional[str] = Field(None, alias='dataset_metadata')

trace class-attribute instance-attribute

trace: Optional[str] = None

trace_id class-attribute instance-attribute

trace_id: Optional[str] = None

observation_id class-attribute instance-attribute

observation_id: Optional[str] = None

actual_ranking class-attribute instance-attribute

actual_ranking: Optional[List[Dict[str, Any]]] = Field(default=None, description='Ordered list of retrieved items, e.g., [{"id": "doc1", "score": 0.9}, {"id": "doc2", "score": 0.8}]')

expected_ranking class-attribute instance-attribute

expected_ranking: Optional[List[Dict[str, Any]]] = Field(default=None, description='Ground truth reference. For IR, e.g., [{"id": "doc1", "relevance": 1.0}, {"id": "doc_abc", "relevance": 0.5}]')

tools_called class-attribute instance-attribute

tools_called: Optional[List[ToolCall]] = Field(default=None, description='Tools that were actually called by the system')

expected_tools class-attribute instance-attribute

expected_tools: Optional[List[ToolCall]] = Field(default=None, description='Tools that should have been called')

user_tags class-attribute instance-attribute

user_tags: List[str] = Field(default_factory=list, description='A list of custom tags to apply to all tool calls in the conversation.')

conversation_extraction_strategy class-attribute instance-attribute

conversation_extraction_strategy: Literal['first', 'last'] = Field(default='last', description="Determines whether to extract 'query' and 'actual_output' from the 'first' or 'last' messages in a conversation.")

conversation_stats property

conversation_stats: Optional[Dict[str, int]]

A dictionary of statistics about the conversation.

agent_trajectory property

agent_trajectory: Optional[List[str]]

An ordered list of tool names called, representing the agent's execution path.

has_errors property

has_errors: Union[bool, None]

Returns True if any tool message in the conversation is marked as an error.

to_transcript

to_transcript() -> str

Converts the conversation messages into a human-readable string transcript.

If the item is not a multi-turn conversation, it returns an empty string.

Returns:

  • str

    A formatted string representing the entire conversation.

extract_by_tag

extract_by_tag(tag: str) -> List[tuple[ToolCall, Optional[ToolMessage]]]

Extracts tool interactions from the conversation that match a specific tag.

Parameters:

  • tag (str) –

    The tag to filter by (e.g., 'RAG', 'GUARDRAIL').

Returns:

  • List[tuple[ToolCall, Optional[ToolMessage]]]

    A list of tuples, where each tuple contains the tagged ToolCall

  • List[tuple[ToolCall, Optional[ToolMessage]]]

    and its corresponding ToolMessage (or None if not found).

get

get(key: str, default: Any = None) -> Any

Get an attribute value by key, similar to dict.get(). This method correctly handles properties like 'query'.

Parameters:

  • key (str) –

    The attribute name to retrieve.

  • default (Any, default: None ) –

    Value to return if attribute doesn't exist.

Returns:

  • Any

    The attribute value or default if not found.

keys

keys() -> List[str]

Return all public attribute names, including properties and aliases.

Returns:

  • List[str]

    A sorted list of public-facing field and property names.

values

values() -> List[Any]

Return all public attribute values, corresponding to the .keys() method.

Returns:

  • List[Any]

    A list of values for the public-facing fields and properties.

items

items() -> List[tuple]

Return all (key, value) pairs for public attributes.

Returns:

  • List[tuple]

    A list of (key, value) tuples for public-facing fields and properties.

subset

subset(fields: List[str], keep_id: bool = True, copy_annotations: bool = False) -> DatasetItem

Create a new DatasetItem with only the specified fields, all others set to None/empty.

Parameters:

  • fields (List[str]) –

    List of field names to keep (e.g., ['query', 'expected_output'])

  • keep_id (bool, default: True ) –

    Whether to preserve the original ID (default: True)

  • copy_annotations (bool, default: False ) –

    Whether to copy annotations (judgment and critique) to the new item (default: False)

Returns:

  • DatasetItem

    New DatasetItem instance with only specified fields populated

Example

Get item with only query and expected_output

subset_item = item.subset(['query', 'expected_output'])

Get item with query, actual_output, and preserve annotations

subset_item = item.subset(['query', 'actual_output'], copy_annotations=True)

evaluation_fields

evaluation_fields() -> DatasetItem

Extract just the evaluation fields.

update

update(other: Union[DatasetItem, Dict[str, Any]], overwrite: bool = True) -> DatasetItem

Update this DatasetItem with values from another DatasetItem or dictionary. This method correctly handles aliases and special merge logic for lists/dicts.

Parameters:

  • other (Union[DatasetItem, Dict[str, Any]]) –

    Another DatasetItem instance or a dictionary to update from.

  • overwrite (bool, default: True ) –

    If True, overwrite existing values. If False, only fill empty fields.

Returns:

  • DatasetItem

    The updated DatasetItem instance (self).

update_runtime

update_runtime(**kwargs) -> DatasetItem

Update runtime-related fields such as actual_output or retrieved_content.

Parameters:

  • **kwargs

    Runtime fields to update.

Returns:

merge_metadata

merge_metadata(metadata: Union[str, Dict[str, Any]]) -> DatasetItem

Merge new metadata into the existing metadata field.

Parameters:

  • metadata (Union[str, Dict[str, Any]]) –

    A dictionary or JSON string to merge.

Returns:

from_dict classmethod

from_dict(data: Dict[str, Any]) -> DatasetItem

Create a DatasetItem from a dictionary.

Parameters:

  • data (Dict[str, Any]) –

    Dictionary containing item data

Returns: