Copilot Agent Architecture

Deep-dive into the OpenAI Agents SDK copilot — request lifecycle, tools, prompt composition, memory, and observability.

Architecture Overview

The AXIS Copilot is a conversational data analyst built on the OpenAI Agents SDK. It accepts natural language questions, translates them into DuckDB SQL queries, and returns formatted answers with optional charts. Every request flows through a layered pipeline of guardrails, context injection, tool execution, and output sanitization.

🛠 12 Tools

From SQL execution to pattern analysis, chart rendering, and dataset export — all callable by the agent.

📚 3 Skills Max

Up to three domain-specific skill files are injected per request, selected by trigger-word matching.

💬 6-Turn History

Conversation context is capped at the last 6 turns to stay within token limits while preserving continuity.

🧠 1000-Char Memory

Cross-session memory injection is capped at 1000 characters — compact SQL patterns and error fixes.

Copilot Request Flow
HTTP POST Input Guardrails Request Classifier Agent Core tool loop (max turns) LLM + 12 tools Output Sanitizer SSE Stream Client 1 2 3 4 5 6 7 tool loop
End-to-end request flow from HTTP POST to SSE-streamed response

Request Lifecycle

Every copilot request passes through eight distinct stages. The pipeline is shared between the Pydantic-AI and OpenAI Agents SDK implementations via run_copilot_request() in request_router.py.

1

HTTP POST

The frontend sends a POST to /copilot/stream/oai with the user message, conversation history, dataset label, and optional agent filter.

2

Input Guardrails

sanitize_input() strips prompt-injection patterns, enforces the 2,000-character limit, and raises RequestBlocked if harmful content is detected.

3

Skill & Example Selection

prepare_request() runs trigger-word matching against SKILL.md files (cap: 3) and selects relevant SQL examples for the agent/message combination.

4

Context Build

The framework-specific context factory constructs OAIContext (or CopilotDeps), injecting the ThoughtStream, dataset label, data context, skills, and memory.

5

Schema Injection

_build_schema_context() generates DDL, computes column stats, injects schema hints, metric catalog, memory, and monitoring context into the system prompt.

6

Agent Execution

The OpenAI Agents SDK Runner.run() enters its tool loop — the LLM decides which tools to call, receives results, and iterates until it has a final answer.

7

Output Sanitization

sanitize_output() strips credentials, traceback lines, and sensitive patterns from the response before it reaches the client.

8

SSE Streaming

Thoughts (reasoning, tool usage, observations) stream in real-time via SSE events. The final response and optional chart/download specs are sent as the closing events.

Components

Agent Core (OAICopilotAgent)

The copilot is implemented as a module in app/copilot/oai_agent.py using the OpenAI Agents SDK. The agent is created once per process and reused across requests. Dynamic instructions are injected per-request via a callable.

OAIContext Fields

Both agent implementations share a base context (BaseCopilotContext) with these fields:

FieldTypePurpose
thought_streamThoughtStreamAsync pub/sub for real-time thought streaming to client
dataset_labelstrWhich dataset to query (e.g. "evaluation", "monitoring")
data_contextdictOptional schema hints passed from the frontend
_cachedictPer-request tool result cache (keyed by tool+params+row count)
chart_specdict | NonePlotly chart specification set by plot_data tool
download_specdict | NoneDownload metadata set by download_data tool
skills_injectionstrRendered skill bodies injected into system prompt
last_sqlstrMost recent SQL query (for error recovery context)
sql_examples_injectionstrMatched Q→SQL examples for in-context learning
agent_namestr | NoneAgent/source filter applied to all queries
sql_executed_this_turnboolTracks whether SQL ran this turn (for memory extraction)
error_fixeslist[dict]Error-to-fix pairs collected during SQL retries

Agent Creation Pattern

agent = Agent(
    name="axis-copilot",
    instructions=lambda ctx: compose_system_prompt(ctx.context),
    tools=[summarize_data, query_data, run_sql, analyze_data,
           compare_data, plot_data, analyze_patterns, ...],
    hooks=CopilotRunHooks(),
    model=LLMProvider.get_model(),
)
ℹ️ Info
The agent is created once at module level and cached for the process lifetime. The instructions parameter accepts a callable, so context is injected fresh for every request without recreating the agent.

Tool System

The copilot has 12 tools registered via the @function_tool decorator. Each tool follows a consistent pattern: cache check, guardrail validation, DuckDB execution, thought emission, and result truncation.

ToolCategoryDescriptionKey Behavior
summarize_data Analysis Dataset overview: schema, row count, filter values, numeric stats Cached per row-count
query_data SQL Filter, search, min/max lookups with fuzzy column matching Cap: 50 records
analyze_data Analysis Statistical aggregations: group-by, averages, distributions Auto numeric detection
compare_data Analysis Compare two groups on numeric metrics with delta computation Paired column stats
query_kpi_data SQL Query the KPI table for business metrics Separate table access
run_sql SQL Execute arbitrary SELECT on DuckDB with 3-attempt auto-fix Guardrail + retry loop
plot_data Output Generate Plotly chart specs (bar, line, scatter, pie, histogram) Sets chart_spec on context
analyze_patterns Analysis Discover patterns, correlations, outliers in the dataset Multi-query analysis
save_as_dataset Output Persist a SQL result as a new named dataset in DuckDB Creates new table
download_data Output Prepare CSV/JSON download of query results Sets download_spec
describe_metric_signals Meta Describe available metrics and their signal schemas Schema introspection
recall_memory Meta Retrieve full cross-session memory for the current schema Returns all stored patterns

Tool Execution Pattern

Every tool uses the tool_span context manager for consistent caching, tracing, and thought emission:

@ft
async def run_sql(ctx: RunContextWrapper[OAIContext], sql: str) -> str:
    deps = ctx.context
    async with tool_span(
        deps, "run_sql", sql,
        f"Executing SQL query...",
        {"sql": sql_fingerprint(sql)},
    ) as (_tracer, _span, cached):
        if cached:
            return cached

        # Guardrail check
        blocked = check_sql_safety(sql)
        if blocked:
            return json.dumps({"error": blocked})

        # Execute with retry loop (up to 3 attempts)
        result = await _execute_with_retry(deps, sql)
        deps.set_cached("run_sql", sql, result)
        return result

SQL Retry Loop

The run_sql tool implements a 3-attempt auto-fix loop for SQL errors:

run_sql retry loop
Execute SQL Attempt 1 Error parse hint Auto-Fix _attempt_sql_fix Retry SQL Attempt 2 Success Fail retry (up to 3x)
SQL errors are parsed, auto-fixed, and retried up to 3 times before surfacing the error

Prompt Composition

The system prompt is assembled dynamically per request by compose_system_prompt(). It stacks multiple layers, each contributing domain knowledge, rules, and context:

Layer 1: BASE_PROMPT Universal rules, persona, output format
Layer 2: TOOL_INSTRUCTIONS Per-tool when/how guidance
Layer 3: DUCKDB_SQL_NOTES DuckDB SQL dialect quirks
Layer 4: CAPABILITIES + SAFETY_RULES What the agent can/cannot do
Layer 5: Dynamic Schema Context DDL + hints + catalog + memory
Layer 6: Skills Injection Up to 3 matched SKILL.md bodies
💡 Tip
Layers 1-4 are static per deployment. Layers 5-6 are rebuilt for every request, incorporating the current schema state, matched skills, SQL examples, and cross-session memory.

Prompt Size Budget

  • Static layers (1-4): ~2,500 tokens — constant overhead
  • Schema DDL: scales with column count (~10 tokens/column)
  • Memory injection: hard-capped at 1,000 characters
  • Skills: typically 200-500 tokens per skill, max 3
  • SQL examples: 2-5 matched examples, ~100 tokens each

Schema Context Building

The schema context is the richest dynamic section of the prompt. It gives the LLM everything it needs to write correct SQL queries against the current dataset.

1

Get Metadata

store.get_metadata(table) returns column names, types, row count, time range, and filter values from the DuckDB catalog.

2

Compute Column Stats

For numeric columns, compute AVG, MIN, MAX (up to 8 columns). For categorical columns, sample distinct values (up to 15).

3

Build DDL

Generates a DuckDB CREATE TABLE DDL string with column names and types, plus inline comments for ranges and sample values.

4

Inject Extras

Appends schema hints (YAML), metric catalog, cross-session memory, monitoring context, agent filter, last SQL query, and SQL examples.

DDL Format Example

-- DuckDB table: eval_data (12,847 rows)
CREATE TABLE eval_data (
  dataset_id VARCHAR,        -- unique record identifier
  query VARCHAR,             -- input prompt
  actual_output VARCHAR,     -- LLM response text
  faithfulness_score DOUBLE, -- range: [0.12 .. 0.99], avg: 0.74
  relevance_score DOUBLE,    -- range: [0.05 .. 1.00], avg: 0.81
  environment VARCHAR,       -- values: production, staging, dev
  source_name VARCHAR,       -- values: alpha_bot, beta_bot
  timestamp TIMESTAMP
);

Injection Sources

SourceOriginWhen Injected
Schema DDLDuckDB catalog metadataAlways
Schema Hintscustom/config/schema_hints.yamlIf file exists
Metric Catalogcustom/config/metric_catalog.yamlIf file exists
Cross-Session MemoryDuckDB KV storeIf copilot_memory_enabled
Monitoring ContextMonitoring metadata cacheIf monitoring dataset active
Agent FilterRequest parameterIf agent_name provided
Last SQLPrevious turn contextIf available
SQL Examplescustom/config/sql_examples/If matched by trigger words
Intelligence

Skill System

Skills are domain-specific instruction files that augment the copilot's capabilities for particular types of questions. They are written as Markdown files with YAML frontmatter.

SKILL.md Format

---
name: trend-analysis
description: Analyze time-series trends and detect anomalies
version: "1.0"
priority: 10
triggers:
  - trend
  - over time
  - time series
  - anomaly
  - spike
  - regression
---

## Trend Analysis Skill

When the user asks about trends over time, follow these steps:
1. Use `run_sql` to query with GROUP BY time bucket
2. Look for sustained direction changes (3+ consecutive periods)
3. Flag anomalies where values exceed 2 standard deviations
...

Trigger Matching

Skills are selected by word-boundary matching against the user message and recent conversation context. The matching is scored and capped at 3 skills per request:

  • Word boundary: triggers match as whole words (not substrings)
  • Scored: more trigger matches = higher priority
  • Cap: maximum 3 skills injected per request
  • Rendered: selected skills are wrapped in XML blocks for clear delineation

Skill Discovery

The skill registry loads from two locations, with custom overriding builtin:

  1. Builtin: app/copilot/skills/builtin/ — shipped with AXIS
  2. Custom: custom/config/skills/ — site-specific overrides (by name)
📝 Note
If a custom skill has the same name as a builtin skill, the custom version takes precedence. This allows deployers to override default behavior without modifying the codebase.

Cross-Session Memory

The copilot learns from past interactions. After each successful SQL-executing request, patterns are extracted and stored in DuckDB for future injection. Memory is keyed by schema fingerprint so different datasets maintain separate memory stores.

Cross-Session Memory Architecture
Request (with SQL) Extract fire-and-forget DuckDB KV keyed by schema fingerprint Inject (Compact) system prompt, 1000 chars Tool (Full) recall_memory tool always injected on-demand
Memory is extracted asynchronously and served via two channels: compact injection and the recall_memory tool

Data Model

CategoryWhat's StoredMax ItemsExample
SQL PatternsSuccessful query templates (sanitized)10SELECT AVG("?") FROM ? WHERE "?" = '?'
Column FrequencyWhich columns are queried most often15faithfulness_score: 12 hits
Error FixesError-message → fix-description pairs5"Column not found" → "Used fuzzy match to find similar column"

Decay & Eviction

MechanismValueEffect
Decay factor0.9Existing hit counts multiplied by 0.9 before adding new hits
Minimum hits0.5Entries below this threshold are evicted during decay
Max injection chars1000Compact injection is truncated at this limit

Schema Fingerprinting

Memory is keyed by a SHA-256 hash (first 12 chars) of the sorted column names and types. When the schema changes (columns added/removed/renamed), the fingerprint changes and the copilot starts fresh memory for the new schema:

def schema_fingerprint(store, table_name: str) -> str:
    meta = store.get_metadata(table_name)
    cols = sorted(
        (c["column_name"], c.get("column_type", ""))
        for c in meta.get("columns", [])
    )
    return hashlib.sha256(str(cols).encode()).hexdigest()[:12]

SQL Error Recovery

When a SQL query fails, the copilot attempts to automatically fix it using heuristic error parsing. The _parse_sql_error_hint() function extracts structured hints from DuckDB error messages, and _attempt_sql_fix() applies corrections.

Error Types Handled

Error TypeDetectionAuto-Fix Strategy
Column Not Found DuckDB "column not found" error Fuzzy-match against available columns, substitute closest match
Type Mismatch CAST errors, type conversion failures Wrap with TRY_CAST() to gracefully handle NULLs
Table Not Found "table not found" in error Substitute with the correct table name from context
Syntax Error DuckDB parser errors Common fixes: missing quotes, wrong function names, dialect quirks

Fix Description

After a successful auto-fix, _describe_sql_fix() generates a human-readable description that is both emitted as a thought and stored in cross-session memory for future reference:

# Example output from _describe_sql_fix():
"Column 'faitfulness_score' not found → replaced with 'faithfulness_score' (fuzzy match)"
"CAST error on 'latency' → wrapped with TRY_CAST to handle NULL values"
⚠️ Warning
If all 3 retry attempts fail, the error is returned to the LLM as a tool result. The agent may then try a different approach, ask the user for clarification, or report the error in its response.
Data Flow

Request Pipeline (Detailed)

Here is the full data flow through the copilot's request pipeline, showing exactly which function handles each stage and what data flows between them.

Request Pipeline Sequence
Router Classifier Agent Sanitizer SSE POST /stream/oai prepare_request() PreparedRequest build_context() Runner.run() tools sanitize_output() thought events (streamed in parallel) reasoning / tool_use observation / planning
Sequence showing data flow between router, classifier, agent, sanitizer, and SSE stream

Key Functions

FunctionModuleResponsibility
run_copilot_request()request_router.pyOrchestrates full lifecycle: prep → context → execute → sanitize
prepare_request()request_classifier.pyInput sanitization, skill selection, SQL example matching
compose_system_prompt()tool_instructions.pyAssembles all prompt layers into final system instructions
_build_schema_context()oai_agent.pyGenerates DDL + stats + hints for the current schema
sanitize_output()guardrails.pyStrips credentials, tracebacks, and sensitive patterns

Streaming & SSE

The copilot streams real-time feedback to the frontend via Server-Sent Events. The ThoughtStream class provides an async pub/sub system that agent tools and hooks publish into, while the SSE endpoint subscribes and forwards events to the client.

SSE Event Types

EventDataWhen Emitted
thoughtThought JSON (type, content, tool, color)Throughout execution — reasoning, tool use, observations
responseFinal answer + optional chart/download specsAfter agent completes and output is sanitized
errorError message stringOn unrecoverable failure
doneEmptySignals end of stream

ThoughtType Enum

Each thought has a type that determines its display color in the frontend:

REASONING TOOL_USE OBSERVATION PLANNING REFLECTION DECISION ERROR SUCCESS

ThoughtStream Architecture

The stream uses an asyncio.Queue internally. Tools and hooks call emit_*() methods (e.g. emit_reasoning(), emit_tool_use()) which enqueue Thought dataclass instances. The SSE endpoint reads from the queue and yields formatted events:

class ThoughtStream:
    def __init__(self):
        self._queue: asyncio.Queue[Thought | None] = asyncio.Queue()

    async def emit_reasoning(self, content: str, node_name: str = None):
        await self._queue.put(Thought(
            type=ThoughtType.REASONING,
            content=content,
            node_name=node_name,
        ))

    async def subscribe(self):
        """Yield thoughts until None sentinel is received."""
        while True:
            thought = await self._queue.get()
            if thought is None:
                break
            yield thought

Tracing

The copilot integrates with Langfuse for observability. Every request creates a trace span hierarchy that captures tool calls, SQL queries, timings, and errors.

Span Hierarchy

Trace Span Hierarchy
copilot.request --- root span
prepare_request --- guardrails + classification
agent.run --- LLM execution
tool:summarize_data --- tool span
tool:run_sql --- tool span
sql.execute --- DuckDB query
sql.error --- error + retry
sql.retry --- fixed query
tool:plot_data --- tool span
sanitize_output --- output guardrails
memory.extract --- fire-and-forget
Trace hierarchy showing nested spans for a typical copilot request with SQL retry

SQL Fingerprinting

For privacy-safe tracing, SQL queries are fingerprinted before being sent to Langfuse. The sql_fingerprint() function replaces literal values with placeholders:

# Input:  SELECT * FROM eval_data WHERE source_name = 'alpha_bot' AND score > 0.8
# Output: SELECT * FROM eval_data WHERE source_name = ? AND score > ?
ℹ️ Info
Tracing auto-configures on first use. Set TRACING_MODE=langfuse plus LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, and LANGFUSE_BASE_URL environment variables. If unset, tracing resolves to a noop.
Reference

Tool Reference

Complete reference for all 12 copilot tools with inputs, outputs, and behavior details.

ToolWhen to UseKey InputsOutputCache?
summarize_data First interaction or "tell me about the data" include_numeric_stats: bool Schema, row count, filter values, numeric stats Yes
query_data Lookup specific records, find min/max, search by ID filter_column, filter_value, find_min_column, find_max_column, search_text, limit Matching records + total count Yes
analyze_data Aggregations, group-by, distributions group_by, metric, aggregation, filters Aggregated statistics Yes
compare_data Compare two slices (A vs B) on metrics group_column, group_a, group_b, metrics Side-by-side stats with deltas Yes
query_kpi_data Access business KPI table filters, limit KPI records Yes
run_sql Complex queries that structured tools cannot handle sql: str Query results (up to 6000 chars) Yes
plot_data User asks for a chart or visualization chart_type, sql, x_column, y_column, title Plotly chart spec (set on context) No
analyze_patterns Look for correlations, outliers, trends analysis_type, columns, filters Pattern analysis results Yes
save_as_dataset User wants to persist a filtered view name, sql, description Confirmation with table name No
download_data User wants to export data sql, format, filename Download spec (set on context) No
describe_metric_signals User asks about available metrics (none) Metric schema descriptions Yes
recall_memory Agent wants full memory context (none) All stored patterns, column freq, error fixes No

Configuration

The copilot is controlled by environment variables and YAML config files in the custom/config/ directory.

Environment Settings

SettingDefaultDescription
copilot_enabledtrueEnable/disable the copilot feature entirely
copilot_memory_enabledtrueEnable/disable cross-session memory extraction and injection
LLM_MODEL_NAMEgpt-4o-miniOpenAI model used for the agent
OPENAI_API_KEYAPI key for OpenAI (required)
TRACING_MODESet to langfuse to enable tracing
LANGFUSE_PUBLIC_KEYLangfuse public key (when tracing enabled)
LANGFUSE_SECRET_KEYLangfuse secret key (when tracing enabled)
LANGFUSE_BASE_URLLangfuse API URL (when tracing enabled)

YAML Config Files

FileLocationPurpose
schema_hints.yamlcustom/config/Column descriptions and usage hints injected into schema context
metric_catalog.yamlcustom/config/Metric definitions, thresholds, and interpretations
sql_examples/*.yamlcustom/config/Example Q&A pairs for in-context learning
skills/*.mdcustom/config/Domain-specific skill files (SKILL.md format)
💡 Tip
Use make setup to create the custom/ directory structure and copy .example templates. All config files are optional — the copilot works with auto-discovered schema alone, but adding hints and examples significantly improves answer quality.

Next Steps

AXIS Documentation · Built with MkDocs Material