Performance improvements#615
Conversation
The Datastore handle is a cheap crossbeam channel sender to the DB worker thread, which already serializes all database access. Wrapping it in a Mutex serialized every HTTP request on top of that, so a slow /query blocked all watcher heartbeats. Drop the mutex and the endpoints_get_lock! macro so Rocket workers can run requests in parallel up to the DB worker. The mutex's poisoning check was also the only path that turned a dead DB worker into an HTTP error, so the Datastore channel requester now maps send/receive failures (worker thread panicked or exited) to DatastoreError::InternalError instead of unwrapping; endpoints degrade to a 5xx response instead of panicking per request. Note: /import dedup (fetch existing -> filter -> insert) is no longer atomic w.r.t. concurrent inserts into the same bucket; acceptable since imports target buckets from other hosts and dedup is best-effort. Includes clippy autofixes applied by the pre-commit hook.
Greptile SummaryThis PR applies a batch of performance and correctness improvements identified with AI assistance. The most impactful change removes the
Confidence Score: 5/5Safe to merge — all changes are additive optimisations with no behavioural regressions; the mutex removal is sound because Datastore is already a thread-safe channel handle. The mutex removal is architecturally correct: Datastore wraps an mpsc_requests RequestSender whose channel already serialises DB access on the worker thread. The migration runs inside a proper EXCLUSIVE transaction. The by-value TryFrom conversions preserve all existing by-ref delegates. The prepare_cached calls operate on a single-threaded connection. No logic changes that could silently alter data. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant H as HTTP Handler
participant S as ServerState
participant D as Datastore
participant W as DB Worker Thread
note over S: datastore field is now a plain Datastore handle, no Mutex
H->>S: borrow state.datastore
H->>D: datastore.get_events(...)
D->>W: send Command via channel
W-->>D: Result of Response
D-->>H: Ok(events) or Err(DatastoreError)
note over W: Single SQLite connection serialised by channel. prepare_cached reuses statements across calls.
Reviews (3): Last reviewed commit: "perf(datastore): cache prepared statemen..." | Re-trigger Greptile |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #615 +/- ##
==========================================
+ Coverage 70.81% 77.10% +6.28%
==========================================
Files 51 62 +11
Lines 2916 4940 +2024
==========================================
+ Hits 2065 3809 +1744
- Misses 851 1131 +280 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
ed8c66f to
d16304a
Compare
|
@greptile-apps review |
d16304a to
64a9ceb
Compare
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
Query functions received their arguments by value but converted them through TryFrom<&DataType> impls, which deep-copied the contained events twice: once cloning the list out of the DataType, and once more cloning each event out of an already-owned value. Add by-value TryFrom<DataType> conversions that move the data out, and make query functions consume their owned argument vector. The by-ref impls remain as delegating wrappers for callers that cannot give up ownership (e.g. tests). The remaining clone at variable lookup in the interpreter is the one necessary copy per reference: the env must retain the value since it may be referenced again, while transforms need owned events. Event-list allocation per function call drops roughly 3x. Includes clippy autofixes applied by the pre-commit hook.
…index (db v5) Every event query filters on bucketrow plus a starttime/endtime range, ordered by starttime. The single-column indexes could only cover one predicate, leaving the rest as scan + sort. The composite index (bucketrow, starttime DESC, endtime) serves the seek, the range scan and the ORDER BY in one pass, with endtime checked from the index without fetching the row. Dropping the three single-column indexes also makes inserts cheaper. starttime is DESC in the index so a forward scan yields newest-first order with equal-timestamp events in insertion order, preserving the ordering callers observed before. Adds a v4->v5 migration test that builds an old-schema database and verifies the upgrade path, with rusqlite as a dev-dependency for it. Includes clippy/lint cleanups across crates.
Every datastore method re-prepared its SQL statement on each call; the heartbeat path alone prepares up to three statements several times per second per watcher. Use rusqlite's prepare_cached so statements are compiled once per connection and reused. The cache lives on the Connection, so statements prepared through a Transaction persist across transactions.
64a9ceb to
f3705be
Compare
|
Nice! Any attempt at measuring the improvement? |
|
Not yet, will do and report back. |
Performance test report:
|
| Benchmark | master | midpoint | branch | speedup (master→branch) |
|---|---|---|---|---|
GET /buckets/ |
0.24 ms | 0.24 ms | 0.24 ms | 1.0× |
events ?limit=100 |
137.5 ms | 138.5 ms | 0.33 ms | 417× |
events ?limit=1000 |
201.5 ms | 202.1 ms | 0.97 ms | 208× |
events ?limit=10000 |
289.3 ms | 291.3 ms | 8.0 ms | 36× |
| events, 1-day range | 31.1 ms | 30.9 ms | 9.4 ms | 3.3× |
| events, 1-week range | 37.4 ms | 37.0 ms | 14.1 ms | 2.7× |
| events, 1-month range | 83.9 ms | 83.4 ms | 36.9 ms | 2.3× |
| events, mid-history week | 36.8 ms | 36.6 ms | 9.8 ms | 3.8× |
| event count (full bucket) | 30.7 ms | 30.7 ms | 6.9 ms | 4.5× |
/query endpoint (realistic aw-webui desktop query: flood + AFK filter + merge by app)
| Time range | master | midpoint | branch | speedup |
|---|---|---|---|---|
| 1 day | 42.7 ms | 40.1 ms | 12.3 ms | 3.5× |
| 1 week | 56.2 ms | 49.4 ms | 19.8 ms | 2.8× |
| 1 month | 145.6 ms | 116.3 ms | 62.5 ms | 2.3× |
| 1 year | 1074 ms | 783 ms | 497 ms | 2.2× |
| browser query, 1 week | 13.0 ms | 11.3 ms | 3.9 ms | 3.3× |
| 7 × 1-day periods | 276.7 ms | 269.3 ms | 71.6 ms | 3.9× |
Concurrency (8 parallel clients)
| Benchmark | master | midpoint | branch |
|---|---|---|---|
| day-queries, throughput | 23.8 req/s | 26.2 req/s | 100.1 req/s (4.2×) |
| day-queries, p50 latency | 336 ms | 304 ms | 79 ms (4.3×) |
| event-range fetches, throughput | 33.4 req/s | 33.5 req/s | 121.4 req/s (3.6×) |
Heartbeat latency while a heavy (year-range) query runs — the watcher-starvation scenario
| master | branch | |
|---|---|---|
| heartbeat p50 | 994 ms | 0.64 ms (~1500×) |
| heartbeat p95 | 1012 ms | 72 ms |
| heartbeats served in 12 s | 12 | 165 |
| concurrent queries completed | 13 | 28 |
On master every heartbeat waits for the entire in-flight query to release the global mutex. On the branch, heartbeats interleave with the query's datastore calls almost immediately.
Write throughput (sequential heartbeats)
| master | branch | |
|---|---|---|
| distinct-data heartbeats | 3988/s | 4085/s (+2.4%) |
| merging heartbeats | 4330/s | 4438/s (+2.5%) |
Startup / migration
- master startup: ~0.2 s.
- branch first startup (v4→v5 migration of 544k events): ~1.2 s, one-time; subsequent startups: 0.07 s.
- Index footprint: three single-column indexes (29.9 MB) → one composite index (15.3 MB), a 49% reduction in index space. The file grows by ~15 MB during migration because the new index is created before the old ones are dropped; the freed 29.9 MB stays on the freelist and is reused by future writes (or reclaimed by
VACUUM).
Attribution by commit
c58421fcomposite index (db v5) — the dominant win for reads. The headline 417×/208× on?limit=Nfetches comes from the query plan: v4 usedevents_bucketrow_indexand then sorted all 309k bucket rows in a temp B-tree on every request (USE TEMP B-TREE FOR ORDER BY); v5 walks(bucketrow, starttime DESC, endtime)already in output order and stops at the LIMIT. Range fetches improve 2–4× by covering all three predicates in one index. This is what watchers' "get last event" polling and aw-webui bucket views hit constantly.de5a12cdeep-copy elimination — visible as master→midpoint on/query: 6% (day) growing to 27% (year), since the redundant copies scaled with the amount of event data flowing through the query.9377bdamutex removal — barely moves identical-request throughput (the DB worker thread still serializes actual DB access), but it is the entire ~1500× heartbeat-under-load result: cheap requests no longer queue behind slow ones. This directly mitigates watcher data-loss/lag while dashboards run heavy queries.f3705beprepared-statement cache — marginal on the write path (+2.4%); per-request cost there is dominated by the channel round-trip and JSON handling, and statement re-prepare was a small fraction.
Caveats
- Single machine (macOS, Apple Silicon), local loopback, one run per configuration; numbers are stable within runs (tight p50/p95) but treat small deltas (<10%) as noise.
- Heartbeat benchmarks used a scratch bucket; real watcher payloads are slightly larger.
- The v4→v5 migration is one-way; after running the branch against a DB, master can still open it but would recreate its old indexes only on a fresh DB (the migration does not downgrade).
Artifacts
- Raw results and harness:
/tmp/aw-perf-test/(results-{master,midpoint,branch}.json,hb-under-load-*.json,bench.py,latency_under_load.py, server logs, golden DB snapshot) - Previous
sqlite-testing.dbpreserved assqlite-testing.db.bak-pre-perf-testnext to the live DB.
…ration Creating the composite index before dropping the old ones grew the database file by the new index's size (~15 MB on a year of data), since the old indexes' pages were only freed afterwards. Dropping first lets the same transaction reuse those pages: verified on a 98 MB production snapshot, file size is now byte-identical before and after the migration.
keep_alive = 0 was added in 2020 (7741a8a) to work around Rocket issue #1254, a keep-alive bug in Rocket 0.4's synchronous hyper 0.10 core. Rocket 0.5's async rewrite does not have that bug, so the workaround now only forces a TCP connection setup/teardown on every single request and surprises clients that expect RFC-compliant keep-alive (the connection is closed without a Connection: close header, which python http.client reports as RemoteDisconnected). Measured on a 98 MB production snapshot: +36% heartbeat throughput (4085 -> 5572/s) and ~30% lower latency on small requests; graceful shutdown remains immediate with idle keep-alive connections open (verified, 0.05s).
|
Damn, nice! Ready to merge? @greptileai review |
|
Not yet, revisiting a WAL change that I had dropped earlier. Still more improvements on the table |
|
I still think it could be worth merging this as is, given 5/5 Greptile and significant improvements already. Further improvements can be against the new baseline. Unless a second db migration is needed or something. |
Follow-up: three more improvements since the report aboveSame harness, same 98 MB / 544k-event production snapshot as the report above. Three commits added since f3705be: c0bdc9d — v5 migration drops the old indexes before creating the composite. As reported, the migration grew the db file by ~15 MB because the new index was built before the old indexes' pages were freed. Dropping first lets the same transaction reuse those pages — re-verified on the snapshot: file size is byte-identical before and after migration, ddd68d4 — re-enabled HTTP keep-alive. The benchmark client kept seeing connections dropped without a dbe2313 — WAL journal mode with explicit Measured impact
Read-side numbers (range fetches, One correction to the original report: the "branch startup 1.2 s (migration)" figure measured Rocket readiness, not the migration — the server starts accepting HTTP before the DB worker finishes migrating ( |
|
I'll drop the new commits and make a new PR. |
dbe2313 to
ddd68d4
Compare
| // WAL turns each commit into a single sequential WAL append+fsync where | ||
| // delete mode paid two fsyncs plus journal-file churn, and lets future | ||
| // reader connections proceed while a commit is in flight. | ||
| // synchronous=FULL is set explicitly (rather than relying on the | ||
| // default) so a commit remains durable on disk the moment it returns; | ||
| // with NORMAL the WAL is only synced at checkpoints, which would | ||
| // silently widen the loss window on power failure. | ||
| // In-memory databases ignore the request (journal_mode stays "memory"). | ||
| let journal_mode: String = conn | ||
| .pragma_update_and_check(None, "journal_mode", "WAL", |row| row.get(0)) | ||
| .expect("Failed to query journal_mode"); | ||
| if !matches!(&method, DatastoreMethod::Memory()) && journal_mode != "wal" { | ||
| warn!("Failed to enable WAL (journal_mode={journal_mode}), continuing without it"); | ||
| } | ||
| conn.pragma_update(None, "synchronous", "FULL") | ||
| .expect("Failed to set synchronous=FULL"); |
There was a problem hiding this comment.
Is this really a good idea? I though WAL with synchronous=NORMAL was best-practice for many small writes like the ones from watchers? Doesn't it lock during flush? Also accelerates SSD wear (might just be a small effect though)
There was a problem hiding this comment.
We can discuss in the new PR :)
There was a problem hiding this comment.
The datastore batches the writes, done every 15 seconds, WAL synchronous=FULL only makes it a single fsync.
Claude wrote: Erik's premise is right in general — WAL+NORMAL is the standard advice when every small write is its own
transaction. But that advice doesn't transfer to this codebase, because of something we confirmed while
making the change: aw-server already does group commit at the application level. The DB worker wraps all
writes in one long-lived transaction and commits every 15 s / 100 events / on bucket changes (worker.rs work
loop). The per-commit fsync that NORMAL exists to eliminate happens here about once per 15 seconds, not once
per heartbeat.
Taking his three concerns one at a time:
"Best practice for many small writes" — the many small watcher writes never individually hit the WAL or an
fsync; they accumulate in the open transaction. synchronous only matters at commit time, and commits are on a
15s cadence. The benchmark backs this up: going from delete+FULL to WAL+FULL increased throughput (5,572 →
5,815 heartbeats/s) while hammering writes thousands of times faster than real watchers do. The fsync simply
isn't in the per-write path. NORMAL would save ~4 fsyncs per minute — unmeasurable.
"Doesn't it lock during flush?" — no, that's the part WAL specifically fixes. The commit fsync is performed
by the writer while holding only the WAL write lock; readers keep reading their snapshot and never block (in
delete mode they did block — that's one of the reasons we switched). The only thing waiting on the flush is
the single writer thread itself, which is inside its own commit regardless of sync mode, and there are no
other writers to contend with.
SSD wear — compared to what shipped before (delete mode at SQLite's default FULL: two fsyncs per commit plus
rollback-journal file create/delete churn every 15s), WAL+FULL is strictly less sync and file churn, not
more. NORMAL would reduce it further, but from "a few fsyncs per minute" to "slightly fewer" — genuinely
negligible either way, as he suspected.
And the cost side of NORMAL, which the general advice tends to gloss over: with NORMAL the WAL is only synced
at checkpoints, and autocheckpoint triggers on page count (1000 pages), not time. At AW's trickle write rate
that can be a very long time — a power loss (the relevant failure for laptops; app crashes lose nothing
under either mode) could drop not just the current 15s batch but potentially hours of activity since the last
checkpoint. To bound that you'd add a time-based wal_checkpoint on a timer — at which point you're fsyncing
on a schedule again and have rebuilt FULL with extra steps, since commits here are already timer-driven.
So my recommendation stands: FULL preserves the exact durability the server has always had, with strictly
fewer syncs than before, and NORMAL's upside is ~zero because of the existing batching. NORMAL wouldn't be
wrong — losing some window-tracking data on a power cut isn't catastrophic — it just buys nothing here.
Want me to post this as a reply on the PR (trimmed to comment length)? I'd lead with "good instinct in
general, but the worker already group-commits every 15s, so the fsync isn't per-write" since that's the fact
that resolves all three concerns.
There was a problem hiding this comment.
That's fair, FULL it is then.
|
Ready to merge |
Asked claude fable to identify some performance tweaks, I think these are worthwhile.