Copilot Agent Architecture

Deep-dive into the OpenAI Agents SDK copilot — request lifecycle, tools, prompt composition, memory, and observability.

Architecture Overview

The AXIS Copilot is a conversational data analyst built on the OpenAI Agents SDK. It accepts natural language questions, translates them into DuckDB SQL queries, and returns formatted answers with optional charts. Every request flows through a layered pipeline of guardrails, context injection, tool execution, and output sanitization.

🛠 12 Tools

From SQL execution to pattern analysis, chart rendering, and dataset export — all callable by the agent.

📚 3 Skills Max

Up to three domain-specific skill files are injected per request, selected by trigger-word matching.

💬 6-Turn History

Conversation context is capped at the last 6 turns to stay within token limits while preserving continuity.

🧠 1000-Char Memory

Cross-session memory injection is capped at 1000 characters — compact SQL patterns and error fixes.

Copilot Request Flow

End-to-end request flow from HTTP POST to SSE-streamed response

Request Lifecycle

Every copilot request passes through eight distinct stages. The pipeline is shared between the Pydantic-AI and OpenAI Agents SDK implementations via run_copilot_request() in request_router.py.

HTTP POST

The frontend sends a POST to /copilot/stream/oai with the user message, conversation history, dataset label, and optional agent filter.

Input Guardrails

sanitize_input() strips prompt-injection patterns, enforces the 2,000-character limit, and raises RequestBlocked if harmful content is detected.

Skill & Example Selection

prepare_request() runs trigger-word matching against SKILL.md files (cap: 3) and selects relevant SQL examples for the agent/message combination.

Context Build

The framework-specific context factory constructs OAIContext (or CopilotDeps), injecting the ThoughtStream, dataset label, data context, skills, and memory.

Schema Injection

_build_schema_context() generates DDL, computes column stats, injects schema hints, metric catalog, memory, and monitoring context into the system prompt.

Agent Execution

The OpenAI Agents SDK Runner.run() enters its tool loop — the LLM decides which tools to call, receives results, and iterates until it has a final answer.

Output Sanitization

sanitize_output() strips credentials, traceback lines, and sensitive patterns from the response before it reaches the client.

SSE Streaming

Thoughts (reasoning, tool usage, observations) stream in real-time via SSE events. The final response and optional chart/download specs are sent as the closing events.

Components

Agent Core (OAICopilotAgent)

The copilot is implemented as a module in app/copilot/oai_agent.py using the OpenAI Agents SDK. The agent is created once per process and reused across requests. Dynamic instructions are injected per-request via a callable.

OAIContext Fields

Both agent implementations share a base context (BaseCopilotContext) with these fields:

Field	Type	Purpose
`thought_stream`	`ThoughtStream`	Async pub/sub for real-time thought streaming to client
`dataset_label`	`str`	Which dataset to query (e.g. `"evaluation"`, `"monitoring"`)
`data_context`	`dict`	Optional schema hints passed from the frontend
`_cache`	`dict`	Per-request tool result cache (keyed by tool+params+row count)
`chart_spec`	`dict \| None`	Plotly chart specification set by `plot_data` tool
`download_spec`	`dict \| None`	Download metadata set by `download_data` tool
`skills_injection`	`str`	Rendered skill bodies injected into system prompt
`last_sql`	`str`	Most recent SQL query (for error recovery context)
`sql_examples_injection`	`str`	Matched Q→SQL examples for in-context learning
`agent_name`	`str \| None`	Agent/source filter applied to all queries
`sql_executed_this_turn`	`bool`	Tracks whether SQL ran this turn (for memory extraction)
`error_fixes`	`list[dict]`	Error-to-fix pairs collected during SQL retries

Agent Creation Pattern

agent = Agent(
    name="axis-copilot",
    instructions=lambda ctx: compose_system_prompt(ctx.context),
    tools=[summarize_data, query_data, run_sql, analyze_data,
           compare_data, plot_data, analyze_patterns, ...],
    hooks=CopilotRunHooks(),
    model=LLMProvider.get_model(),
)

ℹ️ Info

The agent is created once at module level and cached for the process lifetime. The instructions parameter accepts a callable, so context is injected fresh for every request without recreating the agent.

Tool System

The copilot has 12 tools registered via the @function_tool decorator. Each tool follows a consistent pattern: cache check, guardrail validation, DuckDB execution, thought emission, and result truncation.

Tool	Category	Description	Key Behavior
`summarize_data`	Analysis	Dataset overview: schema, row count, filter values, numeric stats	Cached per row-count
`query_data`	SQL	Filter, search, min/max lookups with fuzzy column matching	Cap: 50 records
`analyze_data`	Analysis	Statistical aggregations: group-by, averages, distributions	Auto numeric detection
`compare_data`	Analysis	Compare two groups on numeric metrics with delta computation	Paired column stats
`query_kpi_data`	SQL	Query the KPI table for business metrics	Separate table access
`run_sql`	SQL	Execute arbitrary SELECT on DuckDB with 3-attempt auto-fix	Guardrail + retry loop
`plot_data`	Output	Generate Plotly chart specs (bar, line, scatter, pie, histogram)	Sets `chart_spec` on context
`analyze_patterns`	Analysis	Discover patterns, correlations, outliers in the dataset	Multi-query analysis
`save_as_dataset`	Output	Persist a SQL result as a new named dataset in DuckDB	Creates new table
`download_data`	Output	Prepare CSV/JSON download of query results	Sets `download_spec`
`describe_metric_signals`	Meta	Describe available metrics and their signal schemas	Schema introspection
`recall_memory`	Meta	Retrieve full cross-session memory for the current schema	Returns all stored patterns

Tool Execution Pattern

Every tool uses the tool_span context manager for consistent caching, tracing, and thought emission:

@ft
async def run_sql(ctx: RunContextWrapper[OAIContext], sql: str) -> str:
    deps = ctx.context
    async with tool_span(
        deps, "run_sql", sql,
        f"Executing SQL query...",
        {"sql": sql_fingerprint(sql)},
    ) as (_tracer, _span, cached):
        if cached:
            return cached

        # Guardrail check
        blocked = check_sql_safety(sql)
        if blocked:
            return json.dumps({"error": blocked})

        # Execute with retry loop (up to 3 attempts)
        result = await _execute_with_retry(deps, sql)
        deps.set_cached("run_sql", sql, result)
        return result

SQL Retry Loop

The run_sql tool implements a 3-attempt auto-fix loop for SQL errors:

run_sql retry loop

SQL errors are parsed, auto-fixed, and retried up to 3 times before surfacing the error

Prompt Composition

The system prompt is assembled dynamically per request by compose_system_prompt(). It stacks multiple layers, each contributing domain knowledge, rules, and context:

Layer 1: BASE_PROMPT Universal rules, persona, output format

Layer 2: TOOL_INSTRUCTIONS Per-tool when/how guidance

Layer 3: DUCKDB_SQL_NOTES DuckDB SQL dialect quirks

Layer 4: CAPABILITIES + SAFETY_RULES What the agent can/cannot do

Layer 5: Dynamic Schema Context DDL + hints + catalog + memory

Layer 6: Skills Injection Up to 3 matched SKILL.md bodies

💡 Tip

Layers 1-4 are static per deployment. Layers 5-6 are rebuilt for every request, incorporating the current schema state, matched skills, SQL examples, and cross-session memory.

Prompt Size Budget

Static layers (1-4): ~2,500 tokens — constant overhead
Schema DDL: scales with column count (~10 tokens/column)
Memory injection: hard-capped at 1,000 characters
Skills: typically 200-500 tokens per skill, max 3
SQL examples: 2-5 matched examples, ~100 tokens each

Schema Context Building

The schema context is the richest dynamic section of the prompt. It gives the LLM everything it needs to write correct SQL queries against the current dataset.

Get Metadata

store.get_metadata(table) returns column names, types, row count, time range, and filter values from the DuckDB catalog.

Compute Column Stats

For numeric columns, compute AVG, MIN, MAX (up to 8 columns). For categorical columns, sample distinct values (up to 15).

Build DDL

Generates a DuckDB CREATE TABLE DDL string with column names and types, plus inline comments for ranges and sample values.

Inject Extras

Appends schema hints (YAML), metric catalog, cross-session memory, monitoring context, agent filter, last SQL query, and SQL examples.

DDL Format Example

-- DuckDB table: eval_data (12,847 rows)
CREATE TABLE eval_data (
  dataset_id VARCHAR,        -- unique record identifier
  query VARCHAR,             -- input prompt
  actual_output VARCHAR,     -- LLM response text
  faithfulness_score DOUBLE, -- range: [0.12 .. 0.99], avg: 0.74
  relevance_score DOUBLE,    -- range: [0.05 .. 1.00], avg: 0.81
  environment VARCHAR,       -- values: production, staging, dev
  source_name VARCHAR,       -- values: alpha_bot, beta_bot
  timestamp TIMESTAMP
);

Injection Sources

Source	Origin	When Injected
Schema DDL	DuckDB catalog metadata	Always
Schema Hints	`custom/config/schema_hints.yaml`	If file exists
Metric Catalog	`custom/config/metric_catalog.yaml`	If file exists
Cross-Session Memory	DuckDB KV store	If `copilot_memory_enabled`
Monitoring Context	Monitoring metadata cache	If monitoring dataset active
Agent Filter	Request parameter	If `agent_name` provided
Last SQL	Previous turn context	If available
SQL Examples	`custom/config/sql_examples/`	If matched by trigger words

Intelligence

Skill System

Skills are domain-specific instruction files that augment the copilot's capabilities for particular types of questions. They are written as Markdown files with YAML frontmatter.

SKILL.md Format

---
name: trend-analysis
description: Analyze time-series trends and detect anomalies
version: "1.0"
priority: 10
triggers:
  - trend
  - over time
  - time series
  - anomaly
  - spike
  - regression
---

## Trend Analysis Skill

When the user asks about trends over time, follow these steps:
1. Use `run_sql` to query with GROUP BY time bucket
2. Look for sustained direction changes (3+ consecutive periods)
3. Flag anomalies where values exceed 2 standard deviations
...

Trigger Matching

Skills are selected by word-boundary matching against the user message and recent conversation context. The matching is scored and capped at 3 skills per request:

Word boundary: triggers match as whole words (not substrings)
Scored: more trigger matches = higher priority
Cap: maximum 3 skills injected per request
Rendered: selected skills are wrapped in XML blocks for clear delineation

Skill Discovery

The skill registry loads from two locations, with custom overriding builtin:

Builtin: app/copilot/skills/builtin/ — shipped with AXIS
Custom: custom/config/skills/ — site-specific overrides (by name)

📝 Note

If a custom skill has the same name as a builtin skill, the custom version takes precedence. This allows deployers to override default behavior without modifying the codebase.

Cross-Session Memory

The copilot learns from past interactions. After each successful SQL-executing request, patterns are extracted and stored in DuckDB for future injection. Memory is keyed by schema fingerprint so different datasets maintain separate memory stores.

Cross-Session Memory Architecture

Memory is extracted asynchronously and served via two channels: compact injection and the recall_memory tool

Data Model

Category	What's Stored	Max Items	Example
SQL Patterns	Successful query templates (sanitized)	10	`SELECT AVG("?") FROM ? WHERE "?" = '?'`
Column Frequency	Which columns are queried most often	15	`faithfulness_score: 12 hits`
Error Fixes	Error-message → fix-description pairs	5	`"Column not found" → "Used fuzzy match to find similar column"`

Decay & Eviction

Mechanism	Value	Effect
Decay factor	`0.9`	Existing hit counts multiplied by 0.9 before adding new hits
Minimum hits	`0.5`	Entries below this threshold are evicted during decay
Max injection chars	`1000`	Compact injection is truncated at this limit

Schema Fingerprinting

Memory is keyed by a SHA-256 hash (first 12 chars) of the sorted column names and types. When the schema changes (columns added/removed/renamed), the fingerprint changes and the copilot starts fresh memory for the new schema:

def schema_fingerprint(store, table_name: str) -> str:
    meta = store.get_metadata(table_name)
    cols = sorted(
        (c["column_name"], c.get("column_type", ""))
        for c in meta.get("columns", [])
    )
    return hashlib.sha256(str(cols).encode()).hexdigest()[:12]

SQL Error Recovery

When a SQL query fails, the copilot attempts to automatically fix it using heuristic error parsing. The _parse_sql_error_hint() function extracts structured hints from DuckDB error messages, and _attempt_sql_fix() applies corrections.

Error Types Handled

Error Type	Detection	Auto-Fix Strategy
Column Not Found	DuckDB "column not found" error	Fuzzy-match against available columns, substitute closest match
Type Mismatch	CAST errors, type conversion failures	Wrap with `TRY_CAST()` to gracefully handle NULLs
Table Not Found	"table not found" in error	Substitute with the correct table name from context
Syntax Error	DuckDB parser errors	Common fixes: missing quotes, wrong function names, dialect quirks

Fix Description

After a successful auto-fix, _describe_sql_fix() generates a human-readable description that is both emitted as a thought and stored in cross-session memory for future reference:

# Example output from _describe_sql_fix():
"Column 'faitfulness_score' not found → replaced with 'faithfulness_score' (fuzzy match)"
"CAST error on 'latency' → wrapped with TRY_CAST to handle NULL values"

⚠️ Warning

If all 3 retry attempts fail, the error is returned to the LLM as a tool result. The agent may then try a different approach, ask the user for clarification, or report the error in its response.

Data Flow

Request Pipeline (Detailed)

Here is the full data flow through the copilot's request pipeline, showing exactly which function handles each stage and what data flows between them.

Request Pipeline Sequence

Sequence showing data flow between router, classifier, agent, sanitizer, and SSE stream

Key Functions

Function	Module	Responsibility
`run_copilot_request()`	request_router.py	Orchestrates full lifecycle: prep → context → execute → sanitize
`prepare_request()`	request_classifier.py	Input sanitization, skill selection, SQL example matching
`compose_system_prompt()`	tool_instructions.py	Assembles all prompt layers into final system instructions
`_build_schema_context()`	oai_agent.py	Generates DDL + stats + hints for the current schema
`sanitize_output()`	guardrails.py	Strips credentials, tracebacks, and sensitive patterns

Streaming & SSE

The copilot streams real-time feedback to the frontend via Server-Sent Events. The ThoughtStream class provides an async pub/sub system that agent tools and hooks publish into, while the SSE endpoint subscribes and forwards events to the client.

SSE Event Types

Event	Data	When Emitted
`thought`	Thought JSON (type, content, tool, color)	Throughout execution — reasoning, tool use, observations
`response`	Final answer + optional chart/download specs	After agent completes and output is sanitized
`error`	Error message string	On unrecoverable failure
`done`	Empty	Signals end of stream

ThoughtType Enum

Each thought has a type that determines its display color in the frontend:

REASONING TOOL_USE OBSERVATION PLANNING REFLECTION DECISION ERROR SUCCESS

ThoughtStream Architecture

The stream uses an asyncio.Queue internally. Tools and hooks call emit_*() methods (e.g. emit_reasoning(), emit_tool_use()) which enqueue Thought dataclass instances. The SSE endpoint reads from the queue and yields formatted events:

class ThoughtStream:
    def __init__(self):
        self._queue: asyncio.Queue[Thought | None] = asyncio.Queue()

    async def emit_reasoning(self, content: str, node_name: str = None):
        await self._queue.put(Thought(
            type=ThoughtType.REASONING,
            content=content,
            node_name=node_name,
        ))

    async def subscribe(self):
        """Yield thoughts until None sentinel is received."""
        while True:
            thought = await self._queue.get()
            if thought is None:
                break
            yield thought

Tracing

The copilot integrates with Langfuse for observability. Every request creates a trace span hierarchy that captures tool calls, SQL queries, timings, and errors.

Span Hierarchy

Trace Span Hierarchy

      copilot.request --- root span
    
      prepare_request --- guardrails + classification
    
      agent.run --- LLM execution
    
      tool:summarize_data --- tool span
    
      tool:run_sql --- tool span
    
      sql.execute --- DuckDB query
    
      sql.error --- error + retry
    
      sql.retry --- fixed query
    
      tool:plot_data --- tool span
    
      sanitize_output --- output guardrails
    
      memory.extract --- fire-and-forget

Trace hierarchy showing nested spans for a typical copilot request with SQL retry

SQL Fingerprinting

For privacy-safe tracing, SQL queries are fingerprinted before being sent to Langfuse. The sql_fingerprint() function replaces literal values with placeholders:

# Input:  SELECT * FROM eval_data WHERE source_name = 'alpha_bot' AND score > 0.8
# Output: SELECT * FROM eval_data WHERE source_name = ? AND score > ?

ℹ️ Info

Tracing auto-configures on first use. Set TRACING_MODE=langfuse plus LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, and LANGFUSE_BASE_URL environment variables. If unset, tracing resolves to a noop.

Reference

Tool Reference

Complete reference for all 12 copilot tools with inputs, outputs, and behavior details.

Tool	When to Use	Key Inputs	Output	Cache?
`summarize_data`	First interaction or "tell me about the data"	`include_numeric_stats: bool`	Schema, row count, filter values, numeric stats	Yes
`query_data`	Lookup specific records, find min/max, search by ID	`filter_column`, `filter_value`, `find_min_column`, `find_max_column`, `search_text`, `limit`	Matching records + total count	Yes
`analyze_data`	Aggregations, group-by, distributions	`group_by`, `metric`, `aggregation`, `filters`	Aggregated statistics	Yes
`compare_data`	Compare two slices (A vs B) on metrics	`group_column`, `group_a`, `group_b`, `metrics`	Side-by-side stats with deltas	Yes
`query_kpi_data`	Access business KPI table	`filters`, `limit`	KPI records	Yes
`run_sql`	Complex queries that structured tools cannot handle	`sql: str`	Query results (up to 6000 chars)	Yes
`plot_data`	User asks for a chart or visualization	`chart_type`, `sql`, `x_column`, `y_column`, `title`	Plotly chart spec (set on context)	No
`analyze_patterns`	Look for correlations, outliers, trends	`analysis_type`, `columns`, `filters`	Pattern analysis results	Yes
`save_as_dataset`	User wants to persist a filtered view	`name`, `sql`, `description`	Confirmation with table name	No
`download_data`	User wants to export data	`sql`, `format`, `filename`	Download spec (set on context)	No
`describe_metric_signals`	User asks about available metrics	(none)	Metric schema descriptions	Yes
`recall_memory`	Agent wants full memory context	(none)	All stored patterns, column freq, error fixes	No

Configuration

The copilot is controlled by environment variables and YAML config files in the custom/config/ directory.

Environment Settings

Setting	Default	Description
`copilot_enabled`	`true`	Enable/disable the copilot feature entirely
`copilot_memory_enabled`	`true`	Enable/disable cross-session memory extraction and injection
`LLM_MODEL_NAME`	`gpt-4o-mini`	OpenAI model used for the agent
`OPENAI_API_KEY`	—	API key for OpenAI (required)
`TRACING_MODE`	—	Set to `langfuse` to enable tracing
`LANGFUSE_PUBLIC_KEY`	—	Langfuse public key (when tracing enabled)
`LANGFUSE_SECRET_KEY`	—	Langfuse secret key (when tracing enabled)
`LANGFUSE_BASE_URL`	—	Langfuse API URL (when tracing enabled)

YAML Config Files

File	Location	Purpose
`schema_hints.yaml`	`custom/config/`	Column descriptions and usage hints injected into schema context
`metric_catalog.yaml`	`custom/config/`	Metric definitions, thresholds, and interpretations
`sql_examples/*.yaml`	`custom/config/`	Example Q&A pairs for in-context learning
`skills/*.md`	`custom/config/`	Domain-specific skill files (SKILL.md format)

💡 Tip

Use make setup to create the custom/ directory structure and copy .example templates. All config files are optional — the copilot works with auto-discovered schema alone, but adding hints and examples significantly improves answer quality.

Copilot Agent Architecture

Architecture Overview

🛠 12 Tools

📚 3 Skills Max

💬 6-Turn History

🧠 1000-Char Memory

Request Lifecycle

HTTP POST

Input Guardrails

Skill & Example Selection

Context Build

Schema Injection

Agent Execution

Output Sanitization

SSE Streaming

Agent Core (OAICopilotAgent)

OAIContext Fields

Agent Creation Pattern

Tool System

Tool Execution Pattern

SQL Retry Loop

Prompt Composition

Prompt Size Budget

Schema Context Building

Get Metadata

Compute Column Stats

Build DDL

Inject Extras

DDL Format Example

Injection Sources

Skill System

SKILL.md Format

Trigger Matching

Skill Discovery

Cross-Session Memory

Data Model

Decay & Eviction

Schema Fingerprinting

SQL Error Recovery

Error Types Handled

Fix Description

Request Pipeline (Detailed)

Key Functions

Streaming & SSE

SSE Event Types

ThoughtType Enum

ThoughtStream Architecture

Tracing

Span Hierarchy

SQL Fingerprinting

Tool Reference

Configuration

Environment Settings

YAML Config Files

Next Steps

📈 Monitoring →

⚙ Settings →

▶ Agent Replay →