DuckDB Configuration¶
The DuckDB embedded analytics store is configured via custom/config/duckdb.yaml. This file controls sync behavior, storage location, query concurrency, and parallel read settings.
Setup¶
Or use the project-level setup command which copies all .example templates:
Config File¶
duckdb:
# Enable/disable DuckDB analytics store
enabled: true
# Path to the DuckDB database file (relative to backend directory)
path: "data/local_store.duckdb"
# Single global startup sync switch:
# - startup: run startup sync
# - manual: no startup sync (use /api/store/sync)
sync_mode: "startup"
# Number of rows per chunk during Postgres -> DuckDB streaming
sync_chunk_size: 10000
# Safety cap: stop sync if this many rows are read (logs warning)
# Primary volume control should be via YAML query WHERE clauses
max_sync_rows: 2000000
# Max concurrent DuckDB read queries (thread pool limit)
query_concurrency: 8
# Parallel readers per dataset sync (for COPY-based reads)
# Requires partition_column in the database config
sync_workers: 4
Field Reference¶
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool |
true |
Master switch. Set to false to disable DuckDB entirely |
path |
str |
data/local_store.duckdb |
Path to the DuckDB file, relative to the backend/ directory |
sync_mode |
str |
"startup" |
Global startup behavior: "startup" or "manual" |
sync_chunk_size |
int |
10000 |
Rows per chunk during sync. Larger values use more memory but sync faster |
max_sync_rows |
int |
2000000 |
Safety cap. If hit, sync completes with available rows and sets truncated: true |
query_concurrency |
int |
8 |
Maximum concurrent DuckDB read queries via anyio.CapacityLimiter |
sync_workers |
int |
1 |
Number of parallel readers per dataset sync. Requires partition_column in the database config to take effect |
How Sync Works¶
DuckDB acts as a local cache of your PostgreSQL data. All datasets use a split sync pattern:
- Each database config provides two SQL queries:
dataset_query(records) andresults_query(metrics) - Both queries are read concurrently from Postgres using the configured read strategy
- Each half is written to its own DuckDB internal table (e.g.,
monitoring_dataset,monitoring_results) - A DuckDB JOIN view is created:
CREATE VIEW monitoring_data AS SELECT ... FROM results JOIN dataset ON dataset_id - Metadata (columns, filter values, time range, summary stats) is computed and cached
Read Strategy (tiered fallback)¶
The sync engine selects the fastest available read method:
- Parallel COPY —
sync_workersconcurrentCOPY TOcommands, each reading a partition range viapartition_column. Writes CSV to temp files, loads into DuckDB viaread_csv_auto. Best for large tables. - Single COPY — One
COPY TOfor the whole query. Used whenpartition_columnorsync_workersis not set. - Sequential chunked — asyncpg cursor reading
sync_chunk_sizerows per fetch. Fallback when the database backend does not support COPY.
Incremental Sync¶
When incremental_column is configured in the database YAML (e.g., created_at), the sync engine can skip unchanged data:
- Reads stored watermarks (MAX value of
incremental_columnfrom previous sync) - Wraps each query with
WHERE {column} > '{watermark}' - Appends new rows to existing DuckDB tables (INSERT INTO, no staging swap)
- Updates watermarks to the new MAX value
Incremental mode is used automatically when watermarks are available. On failure, watermarks are cleared and the next sync does a full rebuild.
Periodic Scheduler¶
Datasets with refresh_interval_minutes > 0 are automatically synced on a timer:
monitoring_db:
refresh_interval_minutes: 15 # sync every 15 minutes
incremental_column: created_at # use incremental mode
The scheduler starts after the startup sync completes and runs until shutdown.
Startup Sync Eligibility¶
On startup (if sync_mode: "startup"), each dataset is synced if:
enabled: truein the database configdataset_queryandresults_queryare both configuredauto_load/auto_connectis enabled in its YAML config
Data volume control
The max_sync_rows field is a safety net, not the primary volume control. Use the query fields in your database YAML configs to control data volume:
monitoring_db:
results_query: |
SELECT ...
FROM metric_results
WHERE created_at > NOW() - INTERVAL '7 days'
The WHERE clause is the recommended way to limit how much data is synced.
Manual Sync¶
You can trigger a sync manually via the API:
# Sync all configured datasets
curl -X POST http://localhost:8500/api/store/sync
# Sync a single dataset
curl -X POST http://localhost:8500/api/store/sync/monitoring
# Force full rebuild (ignore watermarks)
curl -X POST "http://localhost:8500/api/store/sync/monitoring?full=true"
# Reset watermarks (next sync will do full rebuild)
curl -X POST http://localhost:8500/api/store/sync/monitoring/reset-watermark
# Check sync status (includes watermarks and refresh intervals)
curl http://localhost:8500/api/store/status
Storage Location¶
The DuckDB file is created at the configured path (default: backend/data/local_store.duckdb). The parent directory is created automatically if it doesn't exist.
Docker deployments
Mount a persistent volume at backend/data/ to preserve the DuckDB file across container restarts. This avoids re-syncing on every deployment.
Disabling DuckDB¶
Set enabled: false to disable the DuckDB store entirely. The backend will still function -- monitoring and analytics endpoints will fall back to processing data from POST request bodies (the pre-DuckDB behavior).
Database Config: Split Queries¶
Each database YAML now uses split queries instead of a single query field:
monitoring_db:
enabled: true
auto_load: true
url: "postgresql://user:pass@host:5432/db"
# Split queries (required for sync)
dataset_query: |
SELECT
id AS dataset_id,
input AS query,
output AS actual_output,
created_at AS timestamp,
environment,
source_name,
trace_id,
latency
FROM traces
results_query: |
SELECT
trace_id AS dataset_id,
metric_name,
score AS metric_score,
metric_category,
explanation
FROM metric_results
# Performance tuning (optional)
partition_column: "id" # Column to split parallel COPY reads
incremental_column: "created_at" # Column for watermark-based incremental sync
refresh_interval_minutes: 15 # Periodic sync interval (0 = disabled)
query_timeout: 120 # Max seconds per query
New Database Config Fields¶
| Field | Type | Default | Description |
|---|---|---|---|
dataset_query |
str |
-- | SQL query for the records/dataset table. Must include dataset_id |
results_query |
str |
-- | SQL query for the metrics/results table. Must include dataset_id |
partition_column |
str |
null |
Column to partition parallel COPY reads on (integer or timestamp) |
incremental_column |
str |
null |
Column used as watermark for incremental sync (e.g., created_at) |
refresh_interval_minutes |
int |
0 |
Periodic sync interval in minutes. 0 = disabled |
Both dataset_query and results_query must be configured for sync to work. They must share a dataset_id column for the JOIN view.
Related¶
- YAML Configs -- Database connection configs that define what data is synced
- DuckDB Architecture -- Technical deep-dive into the sync engine and store
- Data Sources -- CSV upload vs. Postgres auto-load setup guide
- DuckDB Sync Runbook -- Production sync patterns