Skip to content

Performance improvements#615

Merged
ErikBjare merged 7 commits into
ActivityWatch:masterfrom
0xbrayo:performance-improvements
Jun 12, 2026
Merged

Performance improvements#615
ErikBjare merged 7 commits into
ActivityWatch:masterfrom
0xbrayo:performance-improvements

Conversation

@0xbrayo

@0xbrayo 0xbrayo commented Jun 11, 2026

Copy link
Copy Markdown
Member

Asked claude fable to identify some performance tweaks, I think these are worthwhile.

The Datastore handle is a cheap crossbeam channel sender to the DB worker
thread, which already serializes all database access. Wrapping it in a
Mutex serialized every HTTP request on top of that, so a slow /query
blocked all watcher heartbeats. Drop the mutex and the
endpoints_get_lock! macro so Rocket workers can run requests in parallel
up to the DB worker.

The mutex's poisoning check was also the only path that turned a dead DB
worker into an HTTP error, so the Datastore channel requester now maps
send/receive failures (worker thread panicked or exited) to
DatastoreError::InternalError instead of unwrapping; endpoints degrade
to a 5xx response instead of panicking per request.

Note: /import dedup (fetch existing -> filter -> insert) is no longer
atomic w.r.t. concurrent inserts into the same bucket; acceptable since
imports target buckets from other hosts and dedup is best-effort.

Includes clippy autofixes applied by the pre-commit hook.
@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown

Greptile Summary

This PR applies a batch of performance and correctness improvements identified with AI assistance. The most impactful change removes the Mutex<Datastore> wrapper from ServerState — since Datastore is already a channel-sender handle to a single worker thread, the outer mutex was redundant and serialized all concurrent HTTP requests unnecessarily.

  • Mutex removal: ServerState.datastore is now a plain Datastore, eliminating lock contention across all endpoint handlers; a new Datastore::request() helper replaces repetitive .unwrap() chains with proper error propagation.
  • DB schema migration v4→v5: Replaces three single-column event indexes with one composite (bucketrow, starttime DESC, endtime) index inside a BEGIN EXCLUSIVE TRANSACTION, serving seek+range+sort in a single pass and reducing insert overhead; a migration test is included.
  • Code quality: prepare_cached replaces prepare for repeated statements, TryFrom<DataType> by-value impls avoid unnecessary event-vector clones in query functions, and style nits (sort_by_key, +=, is_empty(), vec![] literals) are cleaned up throughout.

Confidence Score: 5/5

Safe to merge — all changes are additive optimisations with no behavioural regressions; the mutex removal is sound because Datastore is already a thread-safe channel handle.

The mutex removal is architecturally correct: Datastore wraps an mpsc_requests RequestSender whose channel already serialises DB access on the worker thread. The migration runs inside a proper EXCLUSIVE transaction. The by-value TryFrom conversions preserve all existing by-ref delegates. The prepare_cached calls operate on a single-threaded connection. No logic changes that could silently alter data.

No files require special attention.

Important Files Changed

Filename Overview
aw-server/src/endpoints/mod.rs Removes Mutex wrapper from ServerState.datastore; adds explanatory comment clarifying why no mutex is needed (Datastore is already a channel handle)
aw-datastore/src/worker.rs Adds Datastore::request() helper replacing unwrap-heavy boilerplate; all public methods simplified; close() now warns instead of panicking on worker-gone error
aw-datastore/src/datastore.rs Adds v4->v5 migration inside BEGIN EXCLUSIVE TRANSACTION replacing three single-column indexes with one composite index; switches all prepare() calls to prepare_cached()
aw-datastore/tests/datastore.rs Adds test_migration_v4_to_v5 that constructs a v4 database, opens it via Datastore, then verifies version bump and index replacement via raw rusqlite queries
aw-query/src/datatype.rs Adds TryFrom (by-value) for all conversion targets; by-ref impls now delegate via clone; Vec::with_capacity pre-allocations added throughout
aw-query/src/functions.rs All query functions consume args via into_iter() instead of borrowing, eliminating per-call event-vector clones
aw-server/src/endpoints/util.rs Removes the endpoints_get_lock! macro (no longer needed after Mutex removal)
aw-client-rust/src/single_instance.rs Adds .truncate(true) to lockfile OpenOptions on Unix; switches to log::error! qualified path to avoid unused-import warning on non-Unix targets

Sequence Diagram

sequenceDiagram
    participant H as HTTP Handler
    participant S as ServerState
    participant D as Datastore
    participant W as DB Worker Thread

    note over S: datastore field is now a plain Datastore handle, no Mutex

    H->>S: borrow state.datastore
    H->>D: datastore.get_events(...)
    D->>W: send Command via channel
    W-->>D: Result of Response
    D-->>H: Ok(events) or Err(DatastoreError)

    note over W: Single SQLite connection serialised by channel. prepare_cached reuses statements across calls.
Loading

Reviews (3): Last reviewed commit: "perf(datastore): cache prepared statemen..." | Re-trigger Greptile

Comment thread aw-query/src/datatype.rs
Comment thread aw-query/src/datatype.rs
Comment thread aw-query/src/datatype.rs
Comment thread aw-datastore/src/worker.rs Outdated
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 81.71206% with 47 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.10%. Comparing base (656f3c9) to head (f3705be).
⚠️ Report is 59 commits behind head on master.

Files with missing lines Patch % Lines
aw-datastore/src/worker.rs 67.34% 16 Missing ⚠️
aw-query/src/datatype.rs 74.54% 14 Missing ⚠️
aw-query/src/functions.rs 78.00% 11 Missing ⚠️
aw-datastore/src/datastore.rs 89.47% 2 Missing ⚠️
aw-datastore/tests/datastore.rs 94.59% 2 Missing ⚠️
aw-server/src/endpoints/bucket.rs 92.30% 1 Missing ⚠️
aw-server/src/main.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #615      +/-   ##
==========================================
+ Coverage   70.81%   77.10%   +6.28%     
==========================================
  Files          51       62      +11     
  Lines        2916     4940    +2024     
==========================================
+ Hits         2065     3809    +1744     
- Misses        851     1131     +280     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@0xbrayo 0xbrayo force-pushed the performance-improvements branch from ed8c66f to d16304a Compare June 11, 2026 11:35
@0xbrayo

0xbrayo commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

@greptile-apps review

@0xbrayo 0xbrayo force-pushed the performance-improvements branch from d16304a to 64a9ceb Compare June 11, 2026 11:57
@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown

Want your agent to iterate on Greptile's feedback? Try greploops.

0xbrayo added 4 commits June 11, 2026 15:15
Query functions received their arguments by value but converted them
through TryFrom<&DataType> impls, which deep-copied the contained
events twice: once cloning the list out of the DataType, and once more
cloning each event out of an already-owned value.

Add by-value TryFrom<DataType> conversions that move the data out, and
make query functions consume their owned argument vector. The by-ref
impls remain as delegating wrappers for callers that cannot give up
ownership (e.g. tests).

The remaining clone at variable lookup in the interpreter is the one
necessary copy per reference: the env must retain the value since it
may be referenced again, while transforms need owned events. Event-list
allocation per function call drops roughly 3x.

Includes clippy autofixes applied by the pre-commit hook.
…index (db v5)

Every event query filters on bucketrow plus a starttime/endtime range,
ordered by starttime. The single-column indexes could only cover one
predicate, leaving the rest as scan + sort. The composite index
(bucketrow, starttime DESC, endtime) serves the seek, the range scan
and the ORDER BY in one pass, with endtime checked from the index
without fetching the row. Dropping the three single-column indexes also
makes inserts cheaper.

starttime is DESC in the index so a forward scan yields newest-first
order with equal-timestamp events in insertion order, preserving the
ordering callers observed before.

Adds a v4->v5 migration test that builds an old-schema database and
verifies the upgrade path, with rusqlite as a dev-dependency for it.

Includes clippy/lint cleanups across crates.
Every datastore method re-prepared its SQL statement on each call; the
heartbeat path alone prepares up to three statements several times per
second per watcher. Use rusqlite's prepare_cached so statements are
compiled once per connection and reused. The cache lives on the
Connection, so statements prepared through a Transaction persist across
transactions.
@0xbrayo 0xbrayo force-pushed the performance-improvements branch from 64a9ceb to f3705be Compare June 11, 2026 12:16
@ErikBjare

Copy link
Copy Markdown
Member

Nice! Any attempt at measuring the improvement?

@0xbrayo

0xbrayo commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Not yet, will do and report back.

@0xbrayo

0xbrayo commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Performance test report: performance-improvements vs master

Date: 2026-06-12
Branch under test: performance-improvements (f3705be), baseline master (a9d0fcf)
Test data: byte-for-byte snapshot of the live database (98 MB, 544,688 events, ~12.5 months of data, largest bucket aw-watcher-window with 309k events), taken with SQLite's online backup API and used as sqlite-testing.db in --testing mode (port 5666, release builds).

Methodology

  • The live DB was snapshotted to a golden copy; each server run started from a fresh copy, so master always saw schema v4 and the branch performed its v4→v5 migration on first start.
  • Three binaries were benchmarked: master, a midpoint at de5a12c (mutex removal + deep-copy elimination, before the index change) to attribute gains per commit, and the full branch.
  • Harness: Python over HTTP against 127.0.0.1:5666 — sequential latency benchmarks (warmup + N iterations, p50 reported), concurrent-throughput benchmarks (8/4 parallel clients), heartbeat write throughput, and a heartbeat-latency-under-query-load scenario.
  • Correctness check: every benchmarked endpoint returned byte-identical response sizes on master and branch (same data, same ordering).

Results

REST endpoints (sequential, p50 latency)

Benchmark master midpoint branch speedup (master→branch)
GET /buckets/ 0.24 ms 0.24 ms 0.24 ms 1.0×
events ?limit=100 137.5 ms 138.5 ms 0.33 ms 417×
events ?limit=1000 201.5 ms 202.1 ms 0.97 ms 208×
events ?limit=10000 289.3 ms 291.3 ms 8.0 ms 36×
events, 1-day range 31.1 ms 30.9 ms 9.4 ms 3.3×
events, 1-week range 37.4 ms 37.0 ms 14.1 ms 2.7×
events, 1-month range 83.9 ms 83.4 ms 36.9 ms 2.3×
events, mid-history week 36.8 ms 36.6 ms 9.8 ms 3.8×
event count (full bucket) 30.7 ms 30.7 ms 6.9 ms 4.5×

/query endpoint (realistic aw-webui desktop query: flood + AFK filter + merge by app)

Time range master midpoint branch speedup
1 day 42.7 ms 40.1 ms 12.3 ms 3.5×
1 week 56.2 ms 49.4 ms 19.8 ms 2.8×
1 month 145.6 ms 116.3 ms 62.5 ms 2.3×
1 year 1074 ms 783 ms 497 ms 2.2×
browser query, 1 week 13.0 ms 11.3 ms 3.9 ms 3.3×
7 × 1-day periods 276.7 ms 269.3 ms 71.6 ms 3.9×

Concurrency (8 parallel clients)

Benchmark master midpoint branch
day-queries, throughput 23.8 req/s 26.2 req/s 100.1 req/s (4.2×)
day-queries, p50 latency 336 ms 304 ms 79 ms (4.3×)
event-range fetches, throughput 33.4 req/s 33.5 req/s 121.4 req/s (3.6×)

Heartbeat latency while a heavy (year-range) query runs — the watcher-starvation scenario

master branch
heartbeat p50 994 ms 0.64 ms (~1500×)
heartbeat p95 1012 ms 72 ms
heartbeats served in 12 s 12 165
concurrent queries completed 13 28

On master every heartbeat waits for the entire in-flight query to release the global mutex. On the branch, heartbeats interleave with the query's datastore calls almost immediately.

Write throughput (sequential heartbeats)

master branch
distinct-data heartbeats 3988/s 4085/s (+2.4%)
merging heartbeats 4330/s 4438/s (+2.5%)

Startup / migration

  • master startup: ~0.2 s.
  • branch first startup (v4→v5 migration of 544k events): ~1.2 s, one-time; subsequent startups: 0.07 s.
  • Index footprint: three single-column indexes (29.9 MB) → one composite index (15.3 MB), a 49% reduction in index space. The file grows by ~15 MB during migration because the new index is created before the old ones are dropped; the freed 29.9 MB stays on the freelist and is reused by future writes (or reclaimed by VACUUM).

Attribution by commit

  • c58421f composite index (db v5) — the dominant win for reads. The headline 417×/208× on ?limit=N fetches comes from the query plan: v4 used events_bucketrow_index and then sorted all 309k bucket rows in a temp B-tree on every request (USE TEMP B-TREE FOR ORDER BY); v5 walks (bucketrow, starttime DESC, endtime) already in output order and stops at the LIMIT. Range fetches improve 2–4× by covering all three predicates in one index. This is what watchers' "get last event" polling and aw-webui bucket views hit constantly.
  • de5a12c deep-copy elimination — visible as master→midpoint on /query: 6% (day) growing to 27% (year), since the redundant copies scaled with the amount of event data flowing through the query.
  • 9377bda mutex removal — barely moves identical-request throughput (the DB worker thread still serializes actual DB access), but it is the entire ~1500× heartbeat-under-load result: cheap requests no longer queue behind slow ones. This directly mitigates watcher data-loss/lag while dashboards run heavy queries.
  • f3705be prepared-statement cache — marginal on the write path (+2.4%); per-request cost there is dominated by the channel round-trip and JSON handling, and statement re-prepare was a small fraction.

Caveats

  • Single machine (macOS, Apple Silicon), local loopback, one run per configuration; numbers are stable within runs (tight p50/p95) but treat small deltas (<10%) as noise.
  • Heartbeat benchmarks used a scratch bucket; real watcher payloads are slightly larger.
  • The v4→v5 migration is one-way; after running the branch against a DB, master can still open it but would recreate its old indexes only on a fresh DB (the migration does not downgrade).

Artifacts

  • Raw results and harness: /tmp/aw-perf-test/ (results-{master,midpoint,branch}.json, hb-under-load-*.json, bench.py, latency_under_load.py, server logs, golden DB snapshot)
  • Previous sqlite-testing.db preserved as sqlite-testing.db.bak-pre-perf-test next to the live DB.

0xbrayo added 2 commits June 12, 2026 13:31
…ration

Creating the composite index before dropping the old ones grew the
database file by the new index's size (~15 MB on a year of data),
since the old indexes' pages were only freed afterwards. Dropping
first lets the same transaction reuse those pages: verified on a
98 MB production snapshot, file size is now byte-identical before
and after the migration.
keep_alive = 0 was added in 2020 (7741a8a) to work around Rocket
issue #1254, a keep-alive bug in Rocket 0.4's synchronous hyper 0.10
core. Rocket 0.5's async rewrite does not have that bug, so the
workaround now only forces a TCP connection setup/teardown on every
single request and surprises clients that expect RFC-compliant
keep-alive (the connection is closed without a Connection: close
header, which python http.client reports as RemoteDisconnected).

Measured on a 98 MB production snapshot: +36% heartbeat throughput
(4085 -> 5572/s) and ~30% lower latency on small requests; graceful
shutdown remains immediate with idle keep-alive connections open
(verified, 0.05s).
@ErikBjare

Copy link
Copy Markdown
Member

Damn, nice! Ready to merge?

@greptileai review

@0xbrayo

0xbrayo commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Not yet, revisiting a WAL change that I had dropped earlier. Still more improvements on the table

@ErikBjare

ErikBjare commented Jun 12, 2026

Copy link
Copy Markdown
Member

I still think it could be worth merging this as is, given 5/5 Greptile and significant improvements already. Further improvements can be against the new baseline. Unless a second db migration is needed or something.

@0xbrayo

0xbrayo commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Follow-up: three more improvements since the report above

Same harness, same 98 MB / 544k-event production snapshot as the report above. Three commits added since f3705be:

c0bdc9d — v5 migration drops the old indexes before creating the composite. As reported, the migration grew the db file by ~15 MB because the new index was built before the old indexes' pages were freed. Dropping first lets the same transaction reuse those pages — re-verified on the snapshot: file size is byte-identical before and after migration, integrity_check ok, migration still ~0.7 s.

ddd68d4 — re-enabled HTTP keep-alive. The benchmark client kept seeing connections dropped without a Connection: close header, which led to keep_alive = 0 in config.rs — a 2020 workaround (7741a8a) for Rocket 0.4 issue #1254 that survived the Rocket 0.5 upgrade. Rocket 0.5's rewritten async core doesn't have that bug, so since then the workaround has only been forcing TCP setup/teardown on every single request from every watcher, and surprising HTTP clients that expect RFC-compliant keep-alive (python's http.client reports it as RemoteDisconnected). Graceful shutdown verified still instant (0.05 s) with idle keep-alive connections open.

dbe2313 — WAL journal mode with explicit synchronous=FULL. Each batch commit is now a single sequential WAL append+fsync instead of delete-mode's two fsyncs plus rollback-journal file churn, and commits no longer block readers (groundwork for read-connection parallelism later). Durability is unchanged: synchronous=FULL syncs the WAL on every commit, and the worker's existing batch-commit policy (15 s / 100 events / bucket change) is untouched. No second db migration involvedjournal_mode=WAL is set at connection open and persists in the db header; any SQLite ≥ 3.7.0 (2010) can still open the file. In-memory datastores (tests, --no-db) are unaffected. Verified on the snapshot: WAL active, clean checkpoint on shutdown, integrity ok.

Measured impact

build heartbeats/s (distinct data) GET events?limit=100 p50
master 3,988 137.5 ms
report above (f3705be) 4,085 0.33 ms
+ keep-alive (ddd68d4) 5,572 0.22 ms
+ WAL (dbe2313) 5,815 (+42% vs report, +46% vs master) 0.22 ms

Read-side numbers (range fetches, /query, concurrency) are unchanged from the report above — within run-to-run noise across all three builds.

One correction to the original report: the "branch startup 1.2 s (migration)" figure measured Rocket readiness, not the migration — the server starts accepting HTTP before the DB worker finishes migrating (/api/0/info responds immediately; the first datastore-touching request blocks until the migration commits, ~0.7 s on this snapshot).

@0xbrayo

0xbrayo commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

I'll drop the new commits and make a new PR.

@0xbrayo 0xbrayo force-pushed the performance-improvements branch from dbe2313 to ddd68d4 Compare June 12, 2026 10:51
Comment thread aw-datastore/src/worker.rs Outdated
Comment on lines +142 to +157
// WAL turns each commit into a single sequential WAL append+fsync where
// delete mode paid two fsyncs plus journal-file churn, and lets future
// reader connections proceed while a commit is in flight.
// synchronous=FULL is set explicitly (rather than relying on the
// default) so a commit remains durable on disk the moment it returns;
// with NORMAL the WAL is only synced at checkpoints, which would
// silently widen the loss window on power failure.
// In-memory databases ignore the request (journal_mode stays "memory").
let journal_mode: String = conn
.pragma_update_and_check(None, "journal_mode", "WAL", |row| row.get(0))
.expect("Failed to query journal_mode");
if !matches!(&method, DatastoreMethod::Memory()) && journal_mode != "wal" {
warn!("Failed to enable WAL (journal_mode={journal_mode}), continuing without it");
}
conn.pragma_update(None, "synchronous", "FULL")
.expect("Failed to set synchronous=FULL");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a good idea? I though WAL with synchronous=NORMAL was best-practice for many small writes like the ones from watchers? Doesn't it lock during flush? Also accelerates SSD wear (might just be a small effect though)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss in the new PR :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The datastore batches the writes, done every 15 seconds, WAL synchronous=FULL only makes it a single fsync.

Claude wrote: Erik's premise is right in general — WAL+NORMAL is the standard advice when every small write is its own
transaction. But that advice doesn't transfer to this codebase, because of something we confirmed while
making the change: aw-server already does group commit at the application level. The DB worker wraps all
writes in one long-lived transaction and commits every 15 s / 100 events / on bucket changes (worker.rs work
loop). The per-commit fsync that NORMAL exists to eliminate happens here about once per 15 seconds, not once
per heartbeat.

Taking his three concerns one at a time:

"Best practice for many small writes" — the many small watcher writes never individually hit the WAL or an
fsync; they accumulate in the open transaction. synchronous only matters at commit time, and commits are on a
15s cadence. The benchmark backs this up: going from delete+FULL to WAL+FULL increased throughput (5,572 →
5,815 heartbeats/s) while hammering writes thousands of times faster than real watchers do. The fsync simply
isn't in the per-write path. NORMAL would save ~4 fsyncs per minute — unmeasurable.

"Doesn't it lock during flush?" — no, that's the part WAL specifically fixes. The commit fsync is performed
by the writer while holding only the WAL write lock; readers keep reading their snapshot and never block (in
delete mode they did block — that's one of the reasons we switched). The only thing waiting on the flush is
the single writer thread itself, which is inside its own commit regardless of sync mode, and there are no
other writers to contend with.

SSD wear — compared to what shipped before (delete mode at SQLite's default FULL: two fsyncs per commit plus
rollback-journal file create/delete churn every 15s), WAL+FULL is strictly less sync and file churn, not
more. NORMAL would reduce it further, but from "a few fsyncs per minute" to "slightly fewer" — genuinely
negligible either way, as he suspected.

And the cost side of NORMAL, which the general advice tends to gloss over: with NORMAL the WAL is only synced
at checkpoints, and autocheckpoint triggers on page count (1000 pages), not time. At AW's trickle write rate
that can be a very long time — a power loss (the relevant failure for laptops; app crashes lose nothing
under either mode) could drop not just the current 15s batch but potentially hours of activity since the last
checkpoint. To bound that you'd add a time-based wal_checkpoint on a timer — at which point you're fsyncing
on a schedule again and have rebuilt FULL with extra steps, since commits here are already timer-driven.

So my recommendation stands: FULL preserves the exact durability the server has always had, with strictly
fewer syncs than before, and NORMAL's upside is ~zero because of the existing batching. NORMAL wouldn't be
wrong — losing some window-tracking data on a power cut isn't catastrophic — it just buys nothing here.

Want me to post this as a reply on the PR (trimmed to comment length)? I'd lead with "good instinct in
general, but the worker already group-commits every 15s, so the fsync isn't per-write" since that's the fact
that resolves all three concerns.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, FULL it is then.

@0xbrayo

0xbrayo commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Ready to merge

@ErikBjare ErikBjare merged commit 73ae710 into ActivityWatch:master Jun 12, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants