Document Synthetic QA Generation¶

The Synthetic QA Generation System provides a user flexible solution for creating question-answer pairs from unstructured documents. This system enables scalable generation of evaluation datasets for RAG models and other conversational systems through an intelligent, multi-stage workflow with built-in quality validation.

Note: This is build as an agentic workflow and will trigger multiple LLM calls, so please be mindful of potential rate limits and associated costs.

Pipeline Stages¶

1

Document Ingestion

Load and prepare documents from directories.

2

Content Chunking

Intelligent splitting while preserving context.

3

Statement Extraction

Extract factual, standalone statements from chunks.

4

Question Generation

Create diverse questions across multiple types.

5

Answer Generation

Generate accurate, grounded answers.

6

Validation & Reflection

Quality assessment with iterative improvement.

Langraph Workflow¶

Workflow

Key Components¶

✓

DocumentQAGenerator

Main orchestrator for the entire pipeline.

✓

GenerationParams

Configuration object for all generation parameters.

✓

QAWorkflowGraph

A LangGraph-based workflow execution engine.

✓

Quality Validators

Multi-dimensional assessment and feedback.

Important Note¶

Synthetic data generation with LLMs is not a one-size-fits-all process. The quality and usefulness of generated data depend heavily on the prompts you use—both during the answer generation phase and within your actual production model.

Best Practice: Reverse-engineer your real application’s instructions when creating meta-prompts. Provide the generation LLM with a realistic example question and answer that mirrors your production setup.

This matters because you’re effectively working with two different prompt contexts:

Generation LLM’s prompt – used to produce synthetic answers.

Your real model’s prompt – used during inference in your application.

If these prompts differ significantly, the resulting style, tone, and level of detail can vary widely—leading to synthetic data that doesn’t truly represent your production environment. Aligning them ensures consistency and reliability in evaluation and training.

Configuration¶

GenerationParams¶

The GenerationParams class provides comprehensive configuration for the generation process:

from axion.synthetic import GenerationParams

params = GenerationParams(
    # Content Processing
    splitter_type="semantic",                    # or "sentence"
    chunk_size=2048,                            # For sentence splitter
    breakpoint_percentile_threshold=95,         # For semantic splitter

    # Generation Control
    num_pairs=1,                                # Number of QA pairs to generate
    statements_per_chunk=5,                     # Statements extracted per chunk

    # Question Configuration
    question_types=["factual", "analytical"],   # Types of questions
    difficulty="medium",                        # easy, medium, hard

    # Answer Configuration
    answer_length="medium",                     # short, medium, long

    # Quality Control
    validation_threshold=0.8,                   # Quality threshold (0-1)
    max_reflection_iterations=3,                # Max improvement iterations

    # Customization
    custom_guidelines="Focus on technical accuracy",
    example_question="What is the primary function of X?",
    example_answer="The primary function of X is..."
)

Usage Patterns¶

Configuration Options¶

from axion.synthetic import DocumentQAGenerator, GenerationParams
import pandas as pd
params = GenerationParams(
    # Content processing
    splitter_type="semantic",
    breakpoint_percentile_threshold=90,
    statements_per_chunk=8,

    # Generation scope
    num_pairs=1,
    question_types=["factual", "conceptual", "analytical", "application"],
    difficulty="easy",
    answer_length="long",

    # Quality requirements
    validation_threshold=0.85,
    max_reflection_iterations=5,

    # Domain customization
    custom_guidelines="""
    Focus on technical accuracy and real-world applications.
    Emphasize practical implementation details.
    Include architectural considerations where relevant.
    """,

    example_question="How would you implement X in a production environment?",
    example_answer="To implement X in production, you would need to consider..."
)

qa_generator = DocumentQAGenerator(
    llm=llm,
    params=params,
    embed_model=embed_model, # required for semantic splitting
    max_concurrent=3  # Conservative for complex processing
)
# Generate QA pairs from directory
results = await qa_generator.generate_from_directory('path/to/documents/')
pd.DataFrame(results)

Parameter Reference¶

Content Processing Parameters¶

Parameter	Type	Default	Description
`splitter_type`	`"semantic"` \| `"sentence"`	`"sentence"`	Text splitting strategy
`chunk_size`	`int`	`2048`	Target chunk size for sentence splitter
`breakpoint_percentile_threshold`	`int`	`95`	Semantic similarity threshold (80-100)
`statements_per_chunk`	`int`	`5`	Number of statements to extract per chunk

Generation Control Parameters¶

Parameter	Type	Default	Description
`num_pairs`	`int`	`10`	Total QA pairs to generate (1-100)
`question_types`	`List[str]`	`["factual", "analytical"]`	Types of questions to generate
`difficulty`	`"easy"` \| `"medium"` \| `"hard"`	`"medium"`	Question complexity level
`answer_length`	`"short"` \| `"medium"` \| `"long"`	`"medium"`	Target answer length

Quality Control Parameters¶

Parameter	Type	Default	Description
`validation_threshold`	`float`	`0.8`	Minimum quality score (0.0-1.0)
`max_reflection_iterations`	`int`	`3`	Maximum improvement iterations (1-10)

Customization Parameters¶

Parameter	Type	Default	Description
`custom_guidelines`	`str`	`None`	Additional generation instructions
`example_question`	`str`	`None`	Example question for style guidance
`example_answer`	`str`	`None`	Example answer for style guidance

Quality Validation¶

Validation Dimensions¶

The system evaluates QA pairs across five dimensions:

1

Accuracy

Is the answer factually correct and well-grounded?

2

Completeness

Does it fully address the question?

3

Relevance

Is it directly aligned with the question?

4

Clarity

Is the language clear and understandable?

5

Factual Integrity

Does it avoid hallucination or extraneous information?

Reflection Process¶

When QA pairs fall below the validation threshold:

graph LR
    I["Identify Issues"] --> F["Generate Feedback"]
    F --> P["Enhance Prompts"]
    P --> R["Regenerate QA"]
    R --> V["Re-validate"]
    V -->|"Below threshold"| I
    V -->|"Passes"| D["Done"]

Integration with Evaluation¶

This system is built into the Dataset class within Axion.

from axion.synthetic.schema import GenerationParams
from axion.dataset import Dataset

dataset = Dataset(name='Product-Growth-Knowledge')

params = GenerationParams(
    num_pairs=1,
    question_types=["factual", "conceptual", "application"],
    difficulty="medium",
    max_chunk_size=4000,
    statements_per_chunk=5,
    answer_length="medium",
    splitter_type="sentence",
    custom_guidelines="Focus on application scenarios in the questions.",
    max_reflection_iterations=3,
    validation_threshold=0.7
)

dataset.synthetic_generate_from_directory(
    directory_path='small_docs/',
    llm=llm,
    params=params
)

Synthetic API Reference Datasets Guide