Synthetic API Reference¶
Synthetic data generation for building evaluation datasets from documents.
DocumentQAGenerator¶
axion.synthetic.DocumentQAGenerator ¶
DocumentQAGenerator(llm: LLMRunnable, params: GenerationParams, embed_model: EmbeddingRunnable = None, max_concurrent: int = 5, show_progress: bool = True, tracer: Optional[BaseTraceHandler] = None, **kwargs)
Orchestrates QA pair generation from multiple documents concurrently.
Initialize DocumentQAGenerator
Parameters:
-
llm(LLMRunnable) –The language model to use for generation.
-
params(GenerationParams) –A GenerationParams object with all configuration.
-
embed_model(EmbeddingRunnable, default:None) –An embedding model used for semantic parsing.
-
max_concurrent(int, default:5) –Max concurrent retrievers
-
show_progress(bool, default:True) –Whether to show progress bars using tqdm
generate_from_directory
async
¶
Main entry point. Loads docs from a directory and generates QA pairs. Args: directory_path: The path to the directory containing documents. Returns: A list of all generated QA pairs.
to_items ¶
Converts a list of QA evaluation results into a List of DatasetItems.
Parameters:
-
results(List[Any]) –A list of result dictionaries, each containing a 'qa_pairs' list.
Returns:
-
List(List) –A list of DatasetItem objects.
to_dataset ¶
Converts a list of QA evaluation results into a structured Dataset object.
The function extracts these pairs, renames the fields to match internal
FieldNames standards, and wraps each into a DatasetItem. These are then
collected into a Dataset for downstream evaluation or analysis.
Parameters:
-
results(List[Any]) –A list of result dictionaries, each containing a 'qa_pairs' list.
-
dataset_name(str) –The name to assign to the resulting Dataset.
Returns:
-
Dataset–A structured Dataset containing DatasetItems with standardized field names.
GenerationParams¶
axion.synthetic.GenerationParams ¶
Bases: RichBaseModel
Configuration parameters for controlling the QA (Question–Answer) generation pipeline.
These settings define how QA pairs are generated from source documents, including the number of pairs, question style and complexity, chunking strategies, and validation thresholds. The configuration supports both factual and synthetic QA creation, enabling flexible generation for training, evaluation, and benchmarking.
Attributes:
-
num_pairs(int) –Total number of QA pairs to generate per document.
-
question_types(List[str]) –List of question types to generate. Common options include: - 'factual' : Direct, fact-based questions - 'conceptual' : Understanding-based questions - 'application' : Scenario-based application questions - 'analysis' : Critical thinking and analysis questions - 'synthetic' : Artificially created questions for stress-testing
-
difficulty(str) –Target difficulty of generated questions. Options include: 'easy', 'medium', and 'hard'.
-
splitter_type(Literal['semantic', 'sentence']) –Chunking strategy for breaking documents into sections: - 'semantic': Embedding-aware splits for context preservation. - 'sentence': Rule-based splits by sentence length.
-
chunk_size(int) –Maximum size (in characters or tokens) of each chunk when using
splitter_type='sentence'. -
statements_per_chunk(int) –Number of candidate statements generated per chunk before filtering and validation.
-
answer_length(str) –Desired length for generated answers. Options: 'short', 'medium', or 'long'.
-
dimensions(Optional[Dict[str, Any]]) –A dictionary guiding synthetic data generation. Possible keys: - 'features' : Data attributes to reflect real-world structure. - 'persona' : Profiles simulating different perspectives. - 'scenarios': Contextual situations to ensure realism.
-
custom_guidelines(Optional[str]) –Additional free-text instructions to condition the QA generation process beyond default behavior.
-
example_question(Optional[str]) –An example question to guide style, tone, and complexity.
-
example_answer(Optional[str]) –An example answer to align generated responses with the desired style and depth.
-
max_reflection_iterations(int) –Maximum self-reflection and retry loops for improving QA quality during validation.
-
validation_threshold(float) –Minimum confidence or faithfulness score (0.0–1.0) required to accept a QA pair.
-
breakpoint_percentile_threshold(int) –Percentile threshold for determining sentence breakpoints in semantic chunking. Higher values create fewer, larger chunks.