Copilot Agent Architecture
Deep-dive into the OpenAI Agents SDK copilot — request lifecycle, tools, prompt composition, memory, and observability.
Architecture Overview
The AXIS Copilot is a conversational data analyst built on the OpenAI Agents SDK. It accepts natural language questions, translates them into DuckDB SQL queries, and returns formatted answers with optional charts. Every request flows through a layered pipeline of guardrails, context injection, tool execution, and output sanitization.
12 Tools
From SQL execution to pattern analysis, chart rendering, and dataset export — all callable by the agent.
3 Skills Max
Up to three domain-specific skill files are injected per request, selected by trigger-word matching.
6-Turn History
Conversation context is capped at the last 6 turns to stay within token limits while preserving continuity.
1000-Char Memory
Cross-session memory injection is capped at 1000 characters — compact SQL patterns and error fixes.
Request Lifecycle
Every copilot request passes through eight distinct stages. The pipeline is shared between the Pydantic-AI and OpenAI Agents SDK implementations via run_copilot_request() in request_router.py.
HTTP POST
The frontend sends a POST to /copilot/stream/oai with the user message, conversation history, dataset label, and optional agent filter.
Input Guardrails
sanitize_input() strips prompt-injection patterns, enforces the 2,000-character limit, and raises RequestBlocked if harmful content is detected.
Skill & Example Selection
prepare_request() runs trigger-word matching against SKILL.md files (cap: 3) and selects relevant SQL examples for the agent/message combination.
Context Build
The framework-specific context factory constructs OAIContext (or CopilotDeps), injecting the ThoughtStream, dataset label, data context, skills, and memory.
Schema Injection
_build_schema_context() generates DDL, computes column stats, injects schema hints, metric catalog, memory, and monitoring context into the system prompt.
Agent Execution
The OpenAI Agents SDK Runner.run() enters its tool loop — the LLM decides which tools to call, receives results, and iterates until it has a final answer.
Output Sanitization
sanitize_output() strips credentials, traceback lines, and sensitive patterns from the response before it reaches the client.
SSE Streaming
Thoughts (reasoning, tool usage, observations) stream in real-time via SSE events. The final response and optional chart/download specs are sent as the closing events.
Agent Core (OAICopilotAgent)
The copilot is implemented as a module in app/copilot/oai_agent.py using the OpenAI Agents SDK. The agent is created once per process and reused across requests. Dynamic instructions are injected per-request via a callable.
OAIContext Fields
Both agent implementations share a base context (BaseCopilotContext) with these fields:
| Field | Type | Purpose |
|---|---|---|
thought_stream | ThoughtStream | Async pub/sub for real-time thought streaming to client |
dataset_label | str | Which dataset to query (e.g. "evaluation", "monitoring") |
data_context | dict | Optional schema hints passed from the frontend |
_cache | dict | Per-request tool result cache (keyed by tool+params+row count) |
chart_spec | dict | None | Plotly chart specification set by plot_data tool |
download_spec | dict | None | Download metadata set by download_data tool |
skills_injection | str | Rendered skill bodies injected into system prompt |
last_sql | str | Most recent SQL query (for error recovery context) |
sql_examples_injection | str | Matched Q→SQL examples for in-context learning |
agent_name | str | None | Agent/source filter applied to all queries |
sql_executed_this_turn | bool | Tracks whether SQL ran this turn (for memory extraction) |
error_fixes | list[dict] | Error-to-fix pairs collected during SQL retries |
Agent Creation Pattern
agent = Agent(
name="axis-copilot",
instructions=lambda ctx: compose_system_prompt(ctx.context),
tools=[summarize_data, query_data, run_sql, analyze_data,
compare_data, plot_data, analyze_patterns, ...],
hooks=CopilotRunHooks(),
model=LLMProvider.get_model(),
)
instructions parameter accepts a callable, so context is injected fresh for every request without recreating the agent.
Tool System
The copilot has 12 tools registered via the @function_tool decorator. Each tool follows a consistent pattern: cache check, guardrail validation, DuckDB execution, thought emission, and result truncation.
| Tool | Category | Description | Key Behavior |
|---|---|---|---|
summarize_data |
Analysis | Dataset overview: schema, row count, filter values, numeric stats | Cached per row-count |
query_data |
SQL | Filter, search, min/max lookups with fuzzy column matching | Cap: 50 records |
analyze_data |
Analysis | Statistical aggregations: group-by, averages, distributions | Auto numeric detection |
compare_data |
Analysis | Compare two groups on numeric metrics with delta computation | Paired column stats |
query_kpi_data |
SQL | Query the KPI table for business metrics | Separate table access |
run_sql |
SQL | Execute arbitrary SELECT on DuckDB with 3-attempt auto-fix | Guardrail + retry loop |
plot_data |
Output | Generate Plotly chart specs (bar, line, scatter, pie, histogram) | Sets chart_spec on context |
analyze_patterns |
Analysis | Discover patterns, correlations, outliers in the dataset | Multi-query analysis |
save_as_dataset |
Output | Persist a SQL result as a new named dataset in DuckDB | Creates new table |
download_data |
Output | Prepare CSV/JSON download of query results | Sets download_spec |
describe_metric_signals |
Meta | Describe available metrics and their signal schemas | Schema introspection |
recall_memory |
Meta | Retrieve full cross-session memory for the current schema | Returns all stored patterns |
Tool Execution Pattern
Every tool uses the tool_span context manager for consistent caching, tracing, and thought emission:
@ft
async def run_sql(ctx: RunContextWrapper[OAIContext], sql: str) -> str:
deps = ctx.context
async with tool_span(
deps, "run_sql", sql,
f"Executing SQL query...",
{"sql": sql_fingerprint(sql)},
) as (_tracer, _span, cached):
if cached:
return cached
# Guardrail check
blocked = check_sql_safety(sql)
if blocked:
return json.dumps({"error": blocked})
# Execute with retry loop (up to 3 attempts)
result = await _execute_with_retry(deps, sql)
deps.set_cached("run_sql", sql, result)
return result
SQL Retry Loop
The run_sql tool implements a 3-attempt auto-fix loop for SQL errors:
Prompt Composition
The system prompt is assembled dynamically per request by compose_system_prompt(). It stacks multiple layers, each contributing domain knowledge, rules, and context:
Prompt Size Budget
- Static layers (1-4): ~2,500 tokens — constant overhead
- Schema DDL: scales with column count (~10 tokens/column)
- Memory injection: hard-capped at 1,000 characters
- Skills: typically 200-500 tokens per skill, max 3
- SQL examples: 2-5 matched examples, ~100 tokens each
Schema Context Building
The schema context is the richest dynamic section of the prompt. It gives the LLM everything it needs to write correct SQL queries against the current dataset.
Get Metadata
store.get_metadata(table) returns column names, types, row count, time range, and filter values from the DuckDB catalog.
Compute Column Stats
For numeric columns, compute AVG, MIN, MAX (up to 8 columns). For categorical columns, sample distinct values (up to 15).
Build DDL
Generates a DuckDB CREATE TABLE DDL string with column names and types, plus inline comments for ranges and sample values.
Inject Extras
Appends schema hints (YAML), metric catalog, cross-session memory, monitoring context, agent filter, last SQL query, and SQL examples.
DDL Format Example
-- DuckDB table: eval_data (12,847 rows)
CREATE TABLE eval_data (
dataset_id VARCHAR, -- unique record identifier
query VARCHAR, -- input prompt
actual_output VARCHAR, -- LLM response text
faithfulness_score DOUBLE, -- range: [0.12 .. 0.99], avg: 0.74
relevance_score DOUBLE, -- range: [0.05 .. 1.00], avg: 0.81
environment VARCHAR, -- values: production, staging, dev
source_name VARCHAR, -- values: alpha_bot, beta_bot
timestamp TIMESTAMP
);
Injection Sources
| Source | Origin | When Injected |
|---|---|---|
| Schema DDL | DuckDB catalog metadata | Always |
| Schema Hints | custom/config/schema_hints.yaml | If file exists |
| Metric Catalog | custom/config/metric_catalog.yaml | If file exists |
| Cross-Session Memory | DuckDB KV store | If copilot_memory_enabled |
| Monitoring Context | Monitoring metadata cache | If monitoring dataset active |
| Agent Filter | Request parameter | If agent_name provided |
| Last SQL | Previous turn context | If available |
| SQL Examples | custom/config/sql_examples/ | If matched by trigger words |
Skill System
Skills are domain-specific instruction files that augment the copilot's capabilities for particular types of questions. They are written as Markdown files with YAML frontmatter.
SKILL.md Format
---
name: trend-analysis
description: Analyze time-series trends and detect anomalies
version: "1.0"
priority: 10
triggers:
- trend
- over time
- time series
- anomaly
- spike
- regression
---
## Trend Analysis Skill
When the user asks about trends over time, follow these steps:
1. Use `run_sql` to query with GROUP BY time bucket
2. Look for sustained direction changes (3+ consecutive periods)
3. Flag anomalies where values exceed 2 standard deviations
...
Trigger Matching
Skills are selected by word-boundary matching against the user message and recent conversation context. The matching is scored and capped at 3 skills per request:
- Word boundary: triggers match as whole words (not substrings)
- Scored: more trigger matches = higher priority
- Cap: maximum 3 skills injected per request
- Rendered: selected skills are wrapped in XML blocks for clear delineation
Skill Discovery
The skill registry loads from two locations, with custom overriding builtin:
- Builtin:
app/copilot/skills/builtin/— shipped with AXIS - Custom:
custom/config/skills/— site-specific overrides (by name)
name as a builtin skill, the custom version takes precedence. This allows deployers to override default behavior without modifying the codebase.
Cross-Session Memory
The copilot learns from past interactions. After each successful SQL-executing request, patterns are extracted and stored in DuckDB for future injection. Memory is keyed by schema fingerprint so different datasets maintain separate memory stores.
Data Model
| Category | What's Stored | Max Items | Example |
|---|---|---|---|
| SQL Patterns | Successful query templates (sanitized) | 10 | SELECT AVG("?") FROM ? WHERE "?" = '?' |
| Column Frequency | Which columns are queried most often | 15 | faithfulness_score: 12 hits |
| Error Fixes | Error-message → fix-description pairs | 5 | "Column not found" → "Used fuzzy match to find similar column" |
Decay & Eviction
| Mechanism | Value | Effect |
|---|---|---|
| Decay factor | 0.9 | Existing hit counts multiplied by 0.9 before adding new hits |
| Minimum hits | 0.5 | Entries below this threshold are evicted during decay |
| Max injection chars | 1000 | Compact injection is truncated at this limit |
Schema Fingerprinting
Memory is keyed by a SHA-256 hash (first 12 chars) of the sorted column names and types. When the schema changes (columns added/removed/renamed), the fingerprint changes and the copilot starts fresh memory for the new schema:
def schema_fingerprint(store, table_name: str) -> str:
meta = store.get_metadata(table_name)
cols = sorted(
(c["column_name"], c.get("column_type", ""))
for c in meta.get("columns", [])
)
return hashlib.sha256(str(cols).encode()).hexdigest()[:12]
SQL Error Recovery
When a SQL query fails, the copilot attempts to automatically fix it using heuristic error parsing. The _parse_sql_error_hint() function extracts structured hints from DuckDB error messages, and _attempt_sql_fix() applies corrections.
Error Types Handled
| Error Type | Detection | Auto-Fix Strategy |
|---|---|---|
| Column Not Found | DuckDB "column not found" error | Fuzzy-match against available columns, substitute closest match |
| Type Mismatch | CAST errors, type conversion failures | Wrap with TRY_CAST() to gracefully handle NULLs |
| Table Not Found | "table not found" in error | Substitute with the correct table name from context |
| Syntax Error | DuckDB parser errors | Common fixes: missing quotes, wrong function names, dialect quirks |
Fix Description
After a successful auto-fix, _describe_sql_fix() generates a human-readable description that is both emitted as a thought and stored in cross-session memory for future reference:
# Example output from _describe_sql_fix():
"Column 'faitfulness_score' not found → replaced with 'faithfulness_score' (fuzzy match)"
"CAST error on 'latency' → wrapped with TRY_CAST to handle NULL values"
Request Pipeline (Detailed)
Here is the full data flow through the copilot's request pipeline, showing exactly which function handles each stage and what data flows between them.
Key Functions
| Function | Module | Responsibility |
|---|---|---|
run_copilot_request() | request_router.py | Orchestrates full lifecycle: prep → context → execute → sanitize |
prepare_request() | request_classifier.py | Input sanitization, skill selection, SQL example matching |
compose_system_prompt() | tool_instructions.py | Assembles all prompt layers into final system instructions |
_build_schema_context() | oai_agent.py | Generates DDL + stats + hints for the current schema |
sanitize_output() | guardrails.py | Strips credentials, tracebacks, and sensitive patterns |
Streaming & SSE
The copilot streams real-time feedback to the frontend via Server-Sent Events. The ThoughtStream class provides an async pub/sub system that agent tools and hooks publish into, while the SSE endpoint subscribes and forwards events to the client.
SSE Event Types
| Event | Data | When Emitted |
|---|---|---|
thought | Thought JSON (type, content, tool, color) | Throughout execution — reasoning, tool use, observations |
response | Final answer + optional chart/download specs | After agent completes and output is sanitized |
error | Error message string | On unrecoverable failure |
done | Empty | Signals end of stream |
ThoughtType Enum
Each thought has a type that determines its display color in the frontend:
ThoughtStream Architecture
The stream uses an asyncio.Queue internally. Tools and hooks call emit_*() methods (e.g. emit_reasoning(), emit_tool_use()) which enqueue Thought dataclass instances. The SSE endpoint reads from the queue and yields formatted events:
class ThoughtStream:
def __init__(self):
self._queue: asyncio.Queue[Thought | None] = asyncio.Queue()
async def emit_reasoning(self, content: str, node_name: str = None):
await self._queue.put(Thought(
type=ThoughtType.REASONING,
content=content,
node_name=node_name,
))
async def subscribe(self):
"""Yield thoughts until None sentinel is received."""
while True:
thought = await self._queue.get()
if thought is None:
break
yield thought
Tracing
The copilot integrates with Langfuse for observability. Every request creates a trace span hierarchy that captures tool calls, SQL queries, timings, and errors.
Span Hierarchy
SQL Fingerprinting
For privacy-safe tracing, SQL queries are fingerprinted before being sent to Langfuse. The sql_fingerprint() function replaces literal values with placeholders:
# Input: SELECT * FROM eval_data WHERE source_name = 'alpha_bot' AND score > 0.8
# Output: SELECT * FROM eval_data WHERE source_name = ? AND score > ?
TRACING_MODE=langfuse plus LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, and LANGFUSE_BASE_URL environment variables. If unset, tracing resolves to a noop.
Tool Reference
Complete reference for all 12 copilot tools with inputs, outputs, and behavior details.
| Tool | When to Use | Key Inputs | Output | Cache? |
|---|---|---|---|---|
summarize_data |
First interaction or "tell me about the data" | include_numeric_stats: bool |
Schema, row count, filter values, numeric stats | Yes |
query_data |
Lookup specific records, find min/max, search by ID | filter_column, filter_value, find_min_column, find_max_column, search_text, limit |
Matching records + total count | Yes |
analyze_data |
Aggregations, group-by, distributions | group_by, metric, aggregation, filters |
Aggregated statistics | Yes |
compare_data |
Compare two slices (A vs B) on metrics | group_column, group_a, group_b, metrics |
Side-by-side stats with deltas | Yes |
query_kpi_data |
Access business KPI table | filters, limit |
KPI records | Yes |
run_sql |
Complex queries that structured tools cannot handle | sql: str |
Query results (up to 6000 chars) | Yes |
plot_data |
User asks for a chart or visualization | chart_type, sql, x_column, y_column, title |
Plotly chart spec (set on context) | No |
analyze_patterns |
Look for correlations, outliers, trends | analysis_type, columns, filters |
Pattern analysis results | Yes |
save_as_dataset |
User wants to persist a filtered view | name, sql, description |
Confirmation with table name | No |
download_data |
User wants to export data | sql, format, filename |
Download spec (set on context) | No |
describe_metric_signals |
User asks about available metrics | (none) | Metric schema descriptions | Yes |
recall_memory |
Agent wants full memory context | (none) | All stored patterns, column freq, error fixes | No |
Configuration
The copilot is controlled by environment variables and YAML config files in the custom/config/ directory.
Environment Settings
| Setting | Default | Description |
|---|---|---|
copilot_enabled | true | Enable/disable the copilot feature entirely |
copilot_memory_enabled | true | Enable/disable cross-session memory extraction and injection |
LLM_MODEL_NAME | gpt-4o-mini | OpenAI model used for the agent |
OPENAI_API_KEY | — | API key for OpenAI (required) |
TRACING_MODE | — | Set to langfuse to enable tracing |
LANGFUSE_PUBLIC_KEY | — | Langfuse public key (when tracing enabled) |
LANGFUSE_SECRET_KEY | — | Langfuse secret key (when tracing enabled) |
LANGFUSE_BASE_URL | — | Langfuse API URL (when tracing enabled) |
YAML Config Files
| File | Location | Purpose |
|---|---|---|
schema_hints.yaml | custom/config/ | Column descriptions and usage hints injected into schema context |
metric_catalog.yaml | custom/config/ | Metric definitions, thresholds, and interpretations |
sql_examples/*.yaml | custom/config/ | Example Q&A pairs for in-context learning |
skills/*.md | custom/config/ | Domain-specific skill files (SKILL.md format) |
make setup to create the custom/ directory structure and copy .example templates. All config files are optional — the copilot works with auto-discovered schema alone, but adding hints and examples significantly improves answer quality.