Skip to content

Synthetic API Reference

Synthetic data generation for building evaluation datasets from documents.

from axion.synthetic import DocumentQAGenerator, GenerationParams

DocumentQAGenerator

axion.synthetic.DocumentQAGenerator

DocumentQAGenerator(llm: LLMRunnable, params: GenerationParams, embed_model: EmbeddingRunnable = None, max_concurrent: int = 5, show_progress: bool = True, tracer: Optional[BaseTraceHandler] = None, **kwargs)

Orchestrates QA pair generation from multiple documents concurrently.

Initialize DocumentQAGenerator

Parameters:

  • llm (LLMRunnable) –

    The language model to use for generation.

  • params (GenerationParams) –

    A GenerationParams object with all configuration.

  • embed_model (EmbeddingRunnable, default: None ) –

    An embedding model used for semantic parsing.

  • max_concurrent (int, default: 5 ) –

    Max concurrent retrievers

  • show_progress (bool, default: True ) –

    Whether to show progress bars using tqdm

generate_from_directory async

generate_from_directory(directory_path: str) -> List[Dict[str, Any]]

Main entry point. Loads docs from a directory and generates QA pairs. Args: directory_path: The path to the directory containing documents. Returns: A list of all generated QA pairs.

to_items

to_items(results: List[Any]) -> List

Converts a list of QA evaluation results into a List of DatasetItems.

Parameters:

  • results (List[Any]) –

    A list of result dictionaries, each containing a 'qa_pairs' list.

Returns:

  • List ( List ) –

    A list of DatasetItem objects.

to_dataset

to_dataset(results: List[Any], dataset_name: str)

Converts a list of QA evaluation results into a structured Dataset object.

The function extracts these pairs, renames the fields to match internal FieldNames standards, and wraps each into a DatasetItem. These are then collected into a Dataset for downstream evaluation or analysis.

Parameters:

  • results (List[Any]) –

    A list of result dictionaries, each containing a 'qa_pairs' list.

  • dataset_name (str) –

    The name to assign to the resulting Dataset.

Returns:

  • Dataset

    A structured Dataset containing DatasetItems with standardized field names.


GenerationParams

axion.synthetic.GenerationParams

GenerationParams(**data)

Bases: RichBaseModel

Configuration parameters for controlling the QA (Question–Answer) generation pipeline.

These settings define how QA pairs are generated from source documents, including the number of pairs, question style and complexity, chunking strategies, and validation thresholds. The configuration supports both factual and synthetic QA creation, enabling flexible generation for training, evaluation, and benchmarking.

Attributes:

  • num_pairs (int) –

    Total number of QA pairs to generate per document.

  • question_types (List[str]) –

    List of question types to generate. Common options include: - 'factual' : Direct, fact-based questions - 'conceptual' : Understanding-based questions - 'application' : Scenario-based application questions - 'analysis' : Critical thinking and analysis questions - 'synthetic' : Artificially created questions for stress-testing

  • difficulty (str) –

    Target difficulty of generated questions. Options include: 'easy', 'medium', and 'hard'.

  • splitter_type (Literal['semantic', 'sentence']) –

    Chunking strategy for breaking documents into sections: - 'semantic': Embedding-aware splits for context preservation. - 'sentence': Rule-based splits by sentence length.

  • chunk_size (int) –

    Maximum size (in characters or tokens) of each chunk when using splitter_type='sentence'.

  • statements_per_chunk (int) –

    Number of candidate statements generated per chunk before filtering and validation.

  • answer_length (str) –

    Desired length for generated answers. Options: 'short', 'medium', or 'long'.

  • dimensions (Optional[Dict[str, Any]]) –

    A dictionary guiding synthetic data generation. Possible keys: - 'features' : Data attributes to reflect real-world structure. - 'persona' : Profiles simulating different perspectives. - 'scenarios': Contextual situations to ensure realism.

  • custom_guidelines (Optional[str]) –

    Additional free-text instructions to condition the QA generation process beyond default behavior.

  • example_question (Optional[str]) –

    An example question to guide style, tone, and complexity.

  • example_answer (Optional[str]) –

    An example answer to align generated responses with the desired style and depth.

  • max_reflection_iterations (int) –

    Maximum self-reflection and retry loops for improving QA quality during validation.

  • validation_threshold (float) –

    Minimum confidence or faithfulness score (0.0–1.0) required to accept a QA pair.

  • breakpoint_percentile_threshold (int) –

    Percentile threshold for determining sentence breakpoints in semantic chunking. Higher values create fewer, larger chunks.