Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ Cargo.lock.bak
!/sql/pg_ask--0.5.4--0.5.5.sql
!/sql/pg_ask--0.5.5--0.5.6.sql
!/sql/pg_ask--0.5.6--0.5.7.sql
!/sql/pg_ask--0.5.7--0.5.8.sql
!/sql/pg_ask--0.5.8--0.5.9.sql
/target-*

# Packaging build outputs (deb/rpm/apk staging + artefacts)
Expand Down
195 changes: 195 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,201 @@ treats internal Rust modules as private regardless of `pub` visibility.
Upgrade scripts ship as `sql/pg_ask--<from>--<to>.sql` and run
automatically under `ALTER EXTENSION pg_ask UPDATE`.

## [0.5.9] — 2026-06-08 — Async job queue (`ask.ask_async`)

Adds an asynchronous job queue (ADR-0018) so a question can be submitted
without blocking the calling backend on the LLM round-trip. `ask.ask_async()`
enqueues a row to `ask._jobs` and returns a job id immediately; a background
worker runs the agent loop in its own backend and writes the answer back.
Upgrade with `ALTER EXTENSION pg_ask UPDATE TO '0.5.9';`.

This is the only correct shape for async work in PostgreSQL: a backend is
single-threaded and SPI is not thread-safe, so "async" means handing the
work to a separate process (a background worker), never running it off-thread
in the caller. The synchronous `ask.ask()` path is unchanged.

### Added

- **`ask.ask_async(question[, kind])` → uuid**: enqueue a job and return its
id immediately (NULL when `pg_ask.jobs_enabled = off`, the default).
`kind` is `'ask'` (full agent loop) or `'sql'` (generate-only).
- **Result accessors** (owner-scoped, NotFound == Unauthorized collapse):
`ask.job_status(id)`, `ask.job_result(id)`, `ask.job_error(id)`,
`ask.cancel_job(id)`.
- **`ask.run_pending_jobs()`**: synchronous drain of up to
`pg_ask.jobs_batch` jobs in the current database, for installs without the
background worker (e.g. driven by `pg_cron`). Recovers orphans first.
- **`ask.prune_jobs('<interval>'[, batch])`**: batched retention of terminal
jobs; operator-only.
- **Background workers** (when loaded via `shared_preload_libraries`): a
launcher process discovers every pg_ask-enabled database and spawns one
dynamic per-database worker, which drains `ask._jobs` on a poll interval.
Re-reconciles so a `CREATE EXTENSION pg_ask` in a new database gets a
worker without a restart.
- **Durable state machine** (`ask._jobs.status`): pending → running →
done/failed, with retry (`pg_ask.jobs_max_attempts`) on transient failure
and orphan recovery (`pg_ask.jobs_orphan_timeout_ms`) so a crashed
worker's in-flight job is re-queued, never lost. Claims use
`FOR UPDATE SKIP LOCKED` so concurrent workers never collide.
- **GUCs**: `jobs_enabled` (off), `jobs_max_attempts` (3),
`jobs_orphan_timeout_ms` (300000), `jobs_batch` (10),
`jobs_poll_interval_ms` (5000).

### Hardening (post-review)

- **Real per-job durability**: the worker now commits the `claim` (running)
transition in its OWN transaction before the agent loop, then commits
`complete`/`fail` in another. A crash mid-loop leaves a committed `running`
row that orphan recovery reclaims — previously claim+execute+complete shared
one transaction, so `running` was never visible and orphan recovery was
effectively dead code.
- **Worker respawn**: the launcher keeps each worker's handle and checks
liveness (`pid()`) every reconcile, respawning a worker that died. The
previous grow-only "spawned" set meant a single worker crash stalled that
database's queue until a full instance restart.
- **Poison-pill fairness**: the synchronous `run_pending_jobs` drain tracks
ids attempted in the pass, so a permanently-failing job that re-queues
itself can't monopolise the batch and starve other pending jobs.
- **Privilege tiers**: the worker-path helpers (`_job_claim`, `_job_complete`,
`_job_fail`, `_job_recover_orphans`) and `ask.run_pending_jobs()` are now
operator-only (revoked from PUBLIC) since they act on jobs regardless of
owner; only the owner-scoped `ask.ask_async` / `ask.job_*` / `ask.cancel_job`
stay public. The background worker connects as superuser and is unaffected.

### Hardening (six-model review pass)

Findings from six parallel reviewer models, each empirically verified:

- **Privilege isolation**: the worker now runs each job's agent loop under
the role that ENQUEUED it (`SET LOCAL ROLE` via `set_config`), so async
SQL/tool execution has exactly the caller's privileges — not the worker's
superuser rights. `_job_claim` returns the owner; the role is reset before
the trusted state-machine writes.
- **Double-execution race closed**: `_job_complete` / `_job_fail` /
`_job_release` now require `worker_pid = pg_backend_pid()`, so a
slow-but-alive worker whose job was orphan-recovered and re-claimed by
another worker can no longer clobber the new attempt with a stale result.
- **dblink conninfo bug**: the launcher's DB-discovery probe used
`quote_ident` (produces `dbname="MyDb"`, which libpq reads as a DB named
`"MyDb"` with the quotes and fails) — switched to `quote_literal`
(`dbname='MyDb'`). Without this, databases with uppercase/special names
were silently never given a worker.
- **Launcher restart leak**: on restart the launcher's old dynamic workers
keep running; it now checks `pg_stat_activity` for an existing
`pg_ask worker: {db}` before spawning (no duplicate per restart) and
terminates its workers on clean shutdown.
- **`run_pending_jobs` honours `jobs_enabled`**: returns 0 immediately and
skips orphan recovery when the queue is disabled (matching the worker and
its own docstring).
- **Orphan timeout default raised 5 min → 1 hour** so a slow-but-legitimate
job (up to `max_iterations * http_total_timeout_ms`) isn't falsely
reclaimed.
- **Indexes**: `_jobs_pending_idx` / `_jobs_running_idx` now lead with `db`
(multi-database claim/recovery scans), and a new partial
`_jobs_terminal_idx (finished_at) WHERE status IN (done,failed,cancelled)`
serves `prune_jobs`.
- **RLS on `ask._jobs`**: a direct `SELECT` is scoped to the caller's own
rows (`_jobs_owner_select`), matching `_traces`; the bgworker (superuser)
bypasses it.
- **Poison-pill drain**: a re-claimed retry is now `_job_release`d back to
`pending` instead of being left stuck in `running` until orphan recovery.
- **FIFO tie-break**: `_job_claim` orders by `(ts, id)` for deterministic
ordering of same-timestamp jobs.

### Notes

- A background worker cannot `LISTEN` (PostgreSQL restricts it to regular
client backends), so workers are poll-driven; the enqueue path still fires
`pg_notify('pg_ask_jobs', id)` for any external listener.
- SIGTERM is checked between jobs, so shutdown takes effect within at most
one job's runtime; a shutdown arriving mid-agent-loop still waits for that
job's LLM call to return or hit `http_total_timeout_ms` (a bgworker's
SIGTERM doesn't trip the agent loop's interrupt check). Keep
`http_total_timeout_ms` modest for a tight shutdown bound.
- Enable the worker by adding `pg_ask` to `shared_preload_libraries` and
setting `pg_ask.jobs_enabled = on`. Without the preload, async still works
via `ask.run_pending_jobs()`.
- The launcher's per-database worker is the clean-architecture seam:
claim/run/complete lives once in `src/jobs`, reused by both the worker and
the synchronous drain.

## [0.5.8] — 2026-06-08 — Event outbox production hardening (`ask.emit`)

Production-hardens the ADR-0017 event outbox without touching its
consumer-facing contract: the `pg_ask_events` channel, the `ask._outbox`
columns, the pending-row query (`WHERE processed_at IS NULL ORDER BY ts`),
and the `ask._outbox_emit(text,jsonb,text)` / `ask._outbox_mark_processed`
/ `ask.emit` signatures are all unchanged, so an existing
`LISTEN pg_ask_events` consumer keeps working across the upgrade. Upgrade
with `ALTER EXTENSION pg_ask UPDATE TO '0.5.8';`.

### Security

- **Bypass closed (`ask._outbox_emit`)**: validation, the `events_enabled`
switch, and the `pg_notify` wake-up now ALL live inside the SECURITY
DEFINER `ask._outbox_emit`, which is the single authority for writing an
event. Previously these lived only in the Rust `ask.emit` wrapper, so a
role with `EXECUTE` on the (PUBLIC-granted) helper could call it directly
to write a newline-laced event name, a multi-megabyte payload, or a row
while events were globally disabled — and without firing a NOTIFY. The
Rust layer is now pure defense-in-depth.

### Added

- **Input validation on `ask.emit`**, owned solely by `ask._outbox_emit`
(the single authority — the Rust layer does no size/charset checks, so
there is nothing to drift): event names must be non-empty, `<= 127` chars,
and match `[A-Za-z0-9][A-Za-z0-9._:-]*` (rejects whitespace / control
chars that could corrupt the durable log or a listener's routing).
`summary` is capped at 8192 **bytes** (`octet_length`, exact for
multi-byte text). Payload size is capped by the new
`pg_ask.events_max_payload_bytes` GUC (default 64 KiB; `0` disables),
measured as serialized-JSON bytes. Violations raise
`invalid_parameter_value` *before* any row is written.
- **Flood control** via two opt-in GUCs, both default `0` (off):
`pg_ask.events_max_per_minute` (per-`(emitter, event)` rate cap) and
`pg_ask.events_dedup_window_ms` (collapse identical
`(emitter, event, payload)` repeats, compared via `md5(payload::text)`).
Both are plain integers (e.g. `60000`), not unit-suffixed — the SQL writer
reads them via `current_setting()::int`.
A suppressed emit is a silent no-op returning `NULL` — it never raises,
so a trigger's transaction is never rolled back, and emits a
`RAISE DEBUG` line so operators can observe suppression under
`log_min_messages = debug1`. Checks run atomically inside
`ask._outbox_emit` under a transaction-scoped advisory lock to avoid a
check-then-insert race. The lock uses the two-argument
`pg_advisory_xact_lock(domain, key)` form (domain = `'pg_ask._outbox'`),
so it shares no key space with the int8 session lock and can never
collide across lock domains.
- **Indexes** for the new hot paths: `_outbox_rate_idx (emitter, event, ts)`
so the rate-limit / dedup checks never full-scan the outbox on emit, and
the partial `_outbox_processed_idx (processed_at) WHERE processed_at IS
NOT NULL` so retention prunes hit an index instead of a seq scan.
- **Retention**: `ask.prune_events('<interval>'[, batch_size])` (e.g.
`'30 days'`) plus the SECURITY DEFINER `ask._outbox_prune(interval, int)`
helper. Deletes only already-delivered rows (`processed_at IS NOT NULL`)
older than the interval, **in batches** (default 10000; `0` = single
DELETE) to bound each DELETE's per-statement memory, lock acquisition,
and dead-tuple churn. Note: as a plpgsql function the whole loop runs in
the caller's single transaction, so batching does NOT shorten lock
lifetime or split WAL across commits — for that, call it repeatedly from
separate transactions. Pending rows are never removed. Locked to
operators (not granted to PUBLIC).

### Changed

- `pg_notify` is fired inside `ask._outbox_emit` (was a separate Rust SPI
call) and is `pg_catalog`-qualified, so a hostile `search_path` cannot
shadow it and a direct helper call can't write a row without waking
listeners.
- Dedup now compares `md5(payload::text)` instead of the `jsonb =` operator
— stable for logically-equal payloads (jsonb normalizes key order /
whitespace before the text cast) and far cheaper than an equality scan
over large jsonb values.
- `ask.emit` doc-comments and SQL comments are now consumer-agnostic
("an external `LISTEN pg_ask_events` consumer") rather than naming a
specific downstream; the orchestrator coupling lives only in ADR-0017.

## [0.5.7] — 2026-06-06 — Security hardening pass

A security / code-review pass closing several ways a low-privilege role
Expand Down
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "pg_ask"
version = "0.5.7"
version = "0.5.9"
edition = "2021"
rust-version = "1.82"
# ^ Minimum Supported Rust Version. pgrx 0.18 requires 1.82+.
Expand Down
11 changes: 6 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -90,11 +90,12 @@ RUN mkdir -p /staging/lib /staging/extension \
&& cp /usr/share/postgresql/${PG_MAJOR}/extension/pg_ask.control \
/usr/share/postgresql/${PG_MAJOR}/extension/pg_ask*.sql \
/staging/extension/ \
&& (cp sql/pg_ask--0.5.5--0.5.6.sql /staging/extension/ 2>/dev/null || true) \
&& (cp sql/pg_ask--0.5.4--0.5.5.sql /staging/extension/ 2>/dev/null || true) \
&& (cp sql/pg_ask--0.5.3--0.5.4.sql /staging/extension/ 2>/dev/null || true) \
&& (cp sql/pg_ask--0.5.2--0.5.3.sql /staging/extension/ 2>/dev/null || true) \
&& (cp sql/pg_ask--0.5.1--0.5.2.sql /staging/extension/ 2>/dev/null || true)
&& cp sql/pg_ask--*--*.sql /staging/extension/ 2>/dev/null || true
# Note: the cp above bundles EVERY hand-written upgrade script via glob
# (not a hardcoded list) so any older install can step through with
# ALTER EXTENSION UPDATE. The previous explicit list silently omitted new
# paths (e.g. 0.5.6->0.5.7, 0.5.7->0.5.8); globbing keeps this correct as
# new migrations land, matching the deb/apk/rpm packaging scripts.

# ──────────────────────────────────────────────────────────────────────────────
# Stage 2: runtime
Expand Down
13 changes: 13 additions & 0 deletions docker/initdb/01-configure.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,19 @@ run_sql() {

configured=0

# Async job launcher support: the background-worker launcher runs in the
# 'postgres' maintenance DB and uses dblink to discover which databases have
# pg_ask installed (so it only spawns workers where there is work). Install
# dblink there if available; harmless when the async queue is unused. Without
# it the launcher falls back to probing every database via a short-lived
# worker (noisier, but still correct).
if psql --username "$POSTGRES_USER" --dbname postgres -tAc \
"SELECT 1 FROM pg_available_extensions WHERE name='dblink'" | grep -q 1; then
echo "[pg_ask] installing dblink in 'postgres' DB for the async job launcher"
psql --username "$POSTGRES_USER" --dbname postgres \
-c "CREATE EXTENSION IF NOT EXISTS dblink;" || true
fi

if [ -n "${PG_ASK_PROVIDER:-}" ]; then
echo "[pg_ask] setting provider = $PG_ASK_PROVIDER"
run_sql "SELECT ask.config('provider', '$PG_ASK_PROVIDER');"
Expand Down
2 changes: 1 addition & 1 deletion pg_ask.control
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
comment = 'Agent-driven natural language interface for PostgreSQL'
default_version = '0.5.7'
default_version = '0.5.9'
module_pathname = '$libdir/pg_ask'
# Do NOT set `schema = 'ask'` here. pgrx already emits an explicit
# `CREATE SCHEMA IF NOT EXISTS ask` plus fully-qualified `ask.*`
Expand Down
Loading