DuckDB Sync Runbook¶
This runbook covers production-friendly patterns for loading data from PostgreSQL into DuckDB:
- Configure split queries (dataset + results)
- Run an initial one-time backfill
- Enable incremental sync with periodic scheduling
- Manage watermarks and force rebuilds when needed
Recommended Pattern¶
- Keep PostgreSQL as source of truth
- Treat DuckDB as an analytics cache
- Use split queries (
dataset_query+results_query) for each data source - Enable incremental sync via
incremental_columnfor low-latency refreshes - Use the periodic scheduler (
refresh_interval_minutes) for automatic sync - Reserve
?full=truefor schema changes or data corrections
1) Configure Split Queries¶
Each database config now uses two queries instead of one. The sync engine reads both concurrently and joins them as a DuckDB view.
monitoring_db:
enabled: true
auto_load: true
url: "postgresql://axis_reader:${DB_PASSWORD}@prod-db.internal:5432/observability"
dataset_query: |
SELECT
trace_id AS dataset_id,
timestamp,
prompt AS query,
completion AS actual_output,
deployment_env AS environment,
service_name AS source_name,
component_name AS source_component,
response_time_ms AS latency
FROM llm_traces
ORDER BY timestamp DESC
results_query: |
SELECT
trace_id AS dataset_id,
metric_name,
metric_value AS metric_score,
metric_category,
explanation
FROM llm_metric_results
ORDER BY created_at DESC
# Performance tuning
partition_column: "id" # Enables parallel COPY reads
incremental_column: "created_at" # Enables watermark-based incremental sync
refresh_interval_minutes: 15 # Auto-sync every 15 minutes
query_timeout: 120
Both queries must include dataset_id
The dataset_query and results_query are joined on dataset_id in DuckDB. Make sure both queries alias or include a column named dataset_id.
2) Run Initial One-Time Backfill¶
Trigger the first sync to populate DuckDB:
Poll status:
Wait for monitoring_data.state to become ready. The first sync always does a full rebuild (staging + atomic swap) since no watermarks exist yet.
3) Enable Incremental Sync¶
After the initial backfill, subsequent syncs use incremental mode automatically when incremental_column is configured:
- The sync engine reads the stored watermark (MAX value of
incremental_columnfrom the previous sync) - Each query is wrapped with
WHERE {column} > '{watermark}' - New rows are appended to existing DuckDB tables (INSERT INTO, no staging swap)
- Watermarks are updated to the new MAX value
Verify Incremental Mode¶
Check the status endpoint to see watermarks and sync type:
Expected output:
{
"state": "ready",
"rows": 150000,
"sync_type": "incremental",
"last_incremental": "2026-02-13T10:15:00+00:00",
"incremental_rows": 1200,
"watermarks": {
"monitoring_dataset": "2026-02-13T10:00:00",
"monitoring_results": "2026-02-13T10:00:00"
},
"refresh_interval_minutes": 15,
"incremental_column": "created_at"
}
4) Periodic Sync (Built-in Scheduler)¶
If refresh_interval_minutes > 0 is set in the database config, the backend automatically schedules periodic syncs:
- The scheduler starts after the startup sync completes
- It sleeps until the next dataset is due
- Syncs all due datasets concurrently using incremental mode
- Repeats until the application shuts down
No external scheduler (Cloud Scheduler, cron) is needed for this pattern.
Recommended Production Config¶
monitoring_db:
refresh_interval_minutes: 15
incremental_column: created_at
duckdb:
sync_mode: "startup" # Initial full sync on boot
sync_workers: 4 # Parallel COPY readers
5) External Scheduling (Alternative)¶
If you prefer external scheduling over the built-in scheduler (e.g., for centralized job management), set refresh_interval_minutes: 0 and trigger syncs via API.
Cloud Scheduler (GCP)¶
gcloud scheduler jobs create http axis-monitoring-sync-daily \
--location=us-central1 \
--schedule="*/15 * * * *" \
--uri="https://YOUR_CLOUD_RUN_URL/api/store/sync/monitoring" \
--http-method=POST \
--oidc-service-account-email="scheduler-invoker@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
--oidc-token-audience="https://YOUR_CLOUD_RUN_URL" \
--time-zone="Etc/UTC"
Grant invoker permission:
gcloud run services add-iam-policy-binding YOUR_SERVICE_NAME \
--region=us-central1 \
--member="serviceAccount:scheduler-invoker@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/run.invoker"
6) Watermark Management¶
View Watermarks¶
Each dataset shows per-sub-table watermarks (e.g., monitoring_dataset, monitoring_results).
Reset Watermarks¶
Force the next sync to do a full rebuild by clearing watermarks:
Force Full Rebuild¶
Skip incremental mode for a single sync run:
When to Force a Full Rebuild¶
- After schema changes in the source database
- After data corrections or backfills in Postgres
- If incremental sync produces unexpected results
- When watermarks become stale (e.g., after a long outage)
Auto-recovery on failure
If an incremental sync fails, watermarks are automatically cleared. The next sync will do a full rebuild without manual intervention.
7) Operational Checks¶
After each sync run:
GET /api/store/statusto verifystate=ready- Check
sync_type— should be"incremental"for regular refreshes - Check
incremental_rows— should be > 0 if source data is changing - Alert on repeated
errorstate or timeouts
Troubleshooting¶
| Symptom | Cause | Fix |
|---|---|---|
sync_type: "full" every time |
No incremental_column or watermarks cleared |
Add incremental_column to DB config |
incremental_rows: 0 |
No new data since last sync | Expected if source is quiet |
state: "error" |
Query timeout or connection failure | Check query_timeout, source DB health |
truncated: true |
Hit max_sync_rows cap |
Use WHERE clauses to limit volume, or increase cap |
| Slow sync | Large tables without COPY | Add partition_column and increase sync_workers |
If the query still times out at 120s:
- Reduce the time window in your WHERE clause (e.g., 7 days instead of 30)
- Add indexes on source Postgres (
timestamp, join/filter columns) - Add
partition_columnto enable parallel COPY reads - Simplify joins or precompute a materialized view