Summary
backend/src/analytics_agent/db/base.py creates the async engine with only echo and SQLite's check_same_thread — no pool parameters at all. For MySQL / PostgreSQL
(asyncpg), this causes two problems:
- Critical bug:
pool_pre_ping defaults to False. Under concurrent load, when a request is cancelled (client disconnect, upstream timeout, asyncio.CancelledError,
etc.), asyncpg leaves that connection in a broken state and returns it to the pool. The next checkout that grabs it hangs forever — no exception, no log line, just a
silent stall. Symptom: after a burst of load the backend stops responding to any DB-backed request, but the process is still alive and the logs look clean.
- Secondary problem:
pool_size / max_overflow / pool_recycle / pool_timeout all fall back to SQLAlchemy defaults (5 / 10 / -1 / 30). That's too small for
production deployments and there's no way to tune it without code changes.
Reproduction
Requires Postgres + asyncpg:
- Point
DATABASE_URL at Postgres (postgresql+asyncpg://...).
- Start the backend and hit any DB-backed endpoint with concurrency ≥
pool_size. Cancel a portion of those requests mid-flight (curl Ctrl-C, or just close the browser
tab).
- After a few rounds, subsequent requests block forever on session checkout. Nothing short of restarting the process recovers.
SQLite is unaffected, so this is invisible in the default local-dev configuration.
Proposed fix
Expose the standard SQLAlchemy pool knobs as environment variables with sane defaults, applied only to non-SQLite engines. pool_pre_ping=True is the key bug fix.
| Env |
Default |
DB_POOL_SIZE |
10 |
DB_MAX_OVERFLOW |
20 |
DB_POOL_RECYCLE |
1800 |
DB_POOL_PRE_PING |
true |
DB_POOL_TIMEOUT |
10 |
Touches three files: .env.example, backend/src/analytics_agent/config.py, backend/src/analytics_agent/db/base.py — +33 / -5 lines total.
Reference
I implemented this fix on a fork and have been running it for a while without issues:
Happy to clean this up into a PR if maintainers are open to the approach.
Summary
backend/src/analytics_agent/db/base.pycreates the async engine with onlyechoand SQLite'scheck_same_thread— no pool parameters at all. For MySQL / PostgreSQL(asyncpg), this causes two problems:
pool_pre_pingdefaults toFalse. Under concurrent load, when a request is cancelled (client disconnect, upstream timeout,asyncio.CancelledError,etc.), asyncpg leaves that connection in a broken state and returns it to the pool. The next checkout that grabs it hangs forever — no exception, no log line, just a
silent stall. Symptom: after a burst of load the backend stops responding to any DB-backed request, but the process is still alive and the logs look clean.
pool_size/max_overflow/pool_recycle/pool_timeoutall fall back to SQLAlchemy defaults (5 / 10 / -1 / 30). That's too small forproduction deployments and there's no way to tune it without code changes.
Reproduction
Requires Postgres + asyncpg:
DATABASE_URLat Postgres (postgresql+asyncpg://...).pool_size. Cancel a portion of those requests mid-flight (curl Ctrl-C, or just close the browsertab).
SQLite is unaffected, so this is invisible in the default local-dev configuration.
Proposed fix
Expose the standard SQLAlchemy pool knobs as environment variables with sane defaults, applied only to non-SQLite engines.
pool_pre_ping=Trueis the key bug fix.DB_POOL_SIZE10DB_MAX_OVERFLOW20DB_POOL_RECYCLE1800DB_POOL_PRE_PINGtrueDB_POOL_TIMEOUT10Touches three files:
.env.example,backend/src/analytics_agent/config.py,backend/src/analytics_agent/db/base.py— +33 / -5 lines total.Reference
I implemented this fix on a fork and have been running it for a while without issues:
feat(db): configurable connection pool tuning for MySQL / PostgreSQLfeat/db-pool-configHappy to clean this up into a PR if maintainers are open to the approach.