fix(daemon): WAKE/idle stability — end the consolidation CPU storm and serve recall on wake by Marsu6996 · Pull Request #20 · CodeAbra/iai-personal-memory-engine

Marsu6996 · 2026-06-20T15:52:22Z

Summary

At WAKE the daemon never actually idled. The consolidation interrupt-check
treated an open MCP socket — and the liveness watchdog's own {"type":"status"}
probe — as "activity", so the sleep cycle never completed and the wake hook
re-fired every ~30 s. On top of that, on a graph-cache miss every background
subsystem (boot preload, sigma audit, foraging, the hippea cascade) recomputed
the full community graph concurrently under the GIL until the liveness
watchdog's socket probe timed out, the daemon SIGKILLed itself, and launchd
relaunched it — a self-sustaining CPU storm. This is the root cause the
maintainer deferred from #17 to this PR. The daemon now genuinely hibernates
when idle, computes the runtime graph once per wake, and reliably serves its
socket on wake; CPU at rest drops from a continuous 90–300 % to 0 %.

This supersedes the original "hibernate when idle" change (kept as the first
commit) and is rebased onto v1.1.4.

What changed (one logical change per commit)

Hibernate when idle — the interrupt-check keys off last_activity_ts
recency only, and the socket server refreshes last_activity_ts solely for
dispatched JSON-RPC method calls (real recall/capture), never for control-plane
status probes. This is the core "always-busy" defect.
Single-flight build_runtime_graph — concurrent WAKE callers collapse into
one compute (leader computes + saves the cache, followers reload it cheaply),
with a bounded re-contend loop that degrades a leader failure / key-shift to
sequential single-flight, never an N-way re-storm. Recall is independent of
the community assignment, so a slightly-stale shared result is safe.
Widen the cache staleness window 10 → 250 — a normal day of capture
(+~150 records / +~1300 edges) crossed ~130 key buckets at WINDOW=10, missing
the cache on essentially every WAKE. The 25 h / dirty>50 fuse in
consult_overlay remains the real freshness backstop.
Drop the boot-preload double-save that overwrote the good cache with a
node_payload=None one, forcing a pandas re-read of every record on the next hit.
Signal wake before kickstart — WakeHandler.signal_wake() (symmetric to
consume_wake_signal) + a best-effort signal before the launchctl kickstart, so
a hibernating daemon enters WAKE and serves recall instead of hibernate-exiting
within a tick. Becomes essential once hibernation actually works.
Exclude tombstoned records from the runtime graph — build_runtime_graph
added every record/edge including soft-deleted / deduped ones, desyncing the
node count from active_records_count() (the cache-validity anchor) so the cache
was permanently invalid and every wake did a full rebuild on an oversized graph.
On a real store this is 9733 → 3612 nodes. The cached community assignment is
dropped only when the live set shrank (len(cached) > records_count, i.e.
tombstoning) so dead nodes are recomputed away; ordinary growth keeps the cached
assignment so a single insert is still absorbed by the staleness window without
an O(n²) recompute (regression-guarded by test_runtime_graph_cache).
A store-count failure must not read as empty — _store_is_empty() returned
True on any exception; a transient HippoIntegrityError (sqlite left in an
error state by a heavy reader) then parked the whole tick on a populated store.
The unknown case now returns False so the tick proceeds.
Reset last_tick_skipped_reason on a successful tick — a single early skip
left a healthy, draining daemon permanently reporting skip=empty_store in
.daemon-state.json. Cleared on any non-skipped tick.
Keep the idle countdown awake during active drains and RPC — the countdown
only read the Node wrapper heartbeat, so a stale heartbeat during a live
deferred-capture drain forced SLEEP (an EXCLUSIVE store lock) into the in-flight
drain. It now also folds capture.is_drain_in_progress() and recent RPC via a
pure, unit-tested _idle_countdown_decision helper. A genuinely idle daemon
still advances to DROWSY/SLEEP; explicit FORCE_SLEEP/user-sleep is unaffected.

Type of change

Affected areas

Testing

pytest passes locally — full default gate
(pytest -m "not perf and not slow and not live", 3538 tests), rebased on
v1.1.4: 3514 passed, 33 skipped, 1 xfailed, 1 failed. The single failure is
test_daemon_fdlimit_and_fsm.py::TestPlistRendersFdFloor::test_rendered_plist_contains_fd_floor,
which is pre-existing — it fails identically on a clean v1.1.4 checkout
(isolated, 0.13 s), so it is not introduced by this branch. (A set of
test_doctor_* rows is environment-flaky on the dev machine — subprocess.run
decoding a large system output as strict UTF-8 — and passes on a clean run.)
ruff check src/ tests/ — no new findings vs the v1.1.3 baseline on the
touched files (the repo ships no ruff config; pre-existing default-rule findings
are untouched).
New tests added for changed behaviour — single-flight collapse,
socket activity tracking (probe vs RPC), tombstone exclusion, store-empty
semantics, drain-aware idle countdown, and tick-flag observability.

Benchmarks

This PR changes daemon scheduling / CPU, not retrieval quality: recall is
independent of the community assignment, and no ranking or scoring path is
touched, so bench.* retrieval-quality numbers are unaffected by construction.

Live worst-case verification on a real ~4 k-record store (cold cache, all
subsystems forced at boot, default watchdog, status probe hammered for 90 s):

Before: CPU a continuous 90–300 % (9–12 cores during the storm); watchdog
socket probe times out; periodic KeepAlive=Crashed relaunch loop.
After: probe latency 0.07–0.74 s throughout; 0 new forensic dumps, 0
wedge-kills; CPU 1–3 cores during the single compute, then 0 % at rest.

Notes for reviewers

Rebased onto v1.1.4 (upstream released v1.1.4 mid-review; our files are
disjoint from its changes, so the rebase was conflict-free). This branch expands
the original PR with the graph-tombstone, store-empty, tick-reason and
drain-aware-idle fixes that the same WAKE/idle failure mode needed.
Fixes Sleep daemon stuck in crisis_mode: cycle loops on a single step (never completes) while sleep-cycle --force works; served recall degrades to flat 1.000 / schema records #17 — this is the CPU-storm / "daemon treats itself as always-busy"
root cause you flagged as the follow-up tracked here. A companion PR handles the
consolidation-correctness half of Sleep daemon stuck in crisis_mode: cycle loops on a single step (never completes) while sleep-cycle --force works; served recall degrades to flat 1.000 / schema records #17 (crisis topology computed on a live,
tombstone-filtered graph; recall score clamp; reflection embedding; crash-mid-
SLEEP recovery) and is marked Relates to Sleep daemon stuck in crisis_mode: cycle loops on a single step (never completes) while sleep-cycle --force works; served recall degrades to flat 1.000 / schema records #17 so it does not auto-close the
issue ahead of this one.
Cooperates with the v1.1.3 crisis auto-expiry, does not duplicate it: the
crisis_mode_since_ts timeout clears a coherent crisis after 72 h; this PR
removes the false WAKE busy-loop that kept re-arming it.
CACHE_VERSION is intentionally not bumped (the window change already
invalidates old keys); a hygiene bump can follow.
Follow-up (noted, not blocking): sigma.compute_topology_snapshot recomputes
communities in-memory after build_runtime_graph and could reuse the assignment
already returned.

Marsu6996 · 2026-06-20T22:01:41Z

Converting to draft to hold the merge for now — not a problem with the change itself, just being cautious.

While investigating daemon stability on my setup I ran into a separate, not-yet-characterised behaviour: when the daemon is forced into WAKE, the graph-similarity computation (gemm/rayon in the native module) drives CPU very high (~9 cores). I stopped it before I could tell whether it eventually converges or is simply being cut short by the watchdog status probe (5s timeout, 3 strikes → self-SIGKILL), so I don't yet know if it's a real runaway or just heavy-but-bounded work killed too early.

Since this PR changes when the daemon hibernates vs. stays awake, I'd rather confirm it doesn't mask or interact with that WAKE-path behaviour before it lands. I'll re-mark it ready once I've measured it properly (let it run in WAKE with the watchdog relaxed and see if CPU settles). #19 is independent of this and stays ready.

CodeAbra · 2026-06-21T07:00:01Z

Thanks for digging into this, and sorry I went quiet. I ran into the same bug on my side and have been working through a fix since yesterday, which is why I was slow to reply. Your write-up lines up with what I was seeing.

Let me get my version landed first, then I'll go through your PR properly so your find gets credited. Really appreciate the careful report.

Marsu6996 · 2026-06-21T07:30:44Z

Thanks — glad it lines up with what you were seeing, and good that you've got a fix in flight too.

No problem at all landing yours first. Please take anything useful from this PR, and don't worry about attribution on my account — I'm mainly just glad the bug's getting closed.

One thing in case it's useful for sanity-checking your version: the bit I'd been unsure about — whether the WAKE CPU is a real runaway or just heavy-but-bounded work the watchdog cuts short — is settled; it's bounded. The trigger is the graph cache missing on essentially every wake (staleness window of 10), so several subsystems recompute the full community detection concurrently and starve the event loop until the status probe times out.

Happy to rebase/trim this onto your fix or close it in favour of yours, whatever's least work for you. Thanks for the quick reply.

…og status probes The sleep/consolidation pipeline defers whenever _interrupt_check reports recent activity. Two independent signals wrongly marked the daemon "active" on nearly every tick, so it never completed a cycle, never hibernated, and the wake-hook re-ran every 30s — a sustained ~200% CPU churn on any long-lived deployment: 1. _interrupt_check returned True whenever mcp_socket.active_connections > 0. Long-lived MCP clients hold their socket open permanently -> always True. 2. Even after removing (1), last_activity_ts was refreshed for EVERY inbound socket line — including the watchdog's own {"type": "status"} liveness probe sent every 7-30s (daemon/_watchdog.py::_probe_status_roundtrip). So the 30s-activity window never elapsed. Fix: _interrupt_check keys off last_activity_ts recency only, and SocketServer refreshes last_activity_ts only for dispatched JSON-RPC method calls (real recall/capture traffic), never for control-plane messages. A busy burst still defers consolidation; a 30s lull now lets the cycle finish and the daemon hibernate. Adds tests/test_socket_activity_tracking.py locking in that a status probe does not count as activity while a real method call does. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ute storm At WAKE several background subsystems (boot preload, sigma identity audit, foraging weak-bridge detection, hippea cascade) each call build_runtime_graph concurrently in their own asyncio.to_thread workers. On a cache miss each one independently ran the full, GIL-bound community detection (mosaic). Three+ at once contended for the GIL, starved the asyncio event loop, and the liveness watchdog's socket probe timed out -> SIGKILL -> launchd relaunch -> loop. Wrap build_runtime_graph in a single-flight gate keyed on the cache key: the first caller (leader) computes and saves the on-disk cache; concurrent callers (followers) wait on an Event and then reload the freshly-saved cache via the existing cheap path. No mutable MemoryGraph is shared between callers (each rebuilds its own shell + single-slot sync hook), and recall is independent of the community assignment, so a slightly-stale shared result is harmless. Followers re-contend in a bounded loop rather than recomputing unconditionally: if the leader fails before saving, the cache key shifts mid-burst, or the wait times out, the woken followers loop back and exactly one becomes the next leader while the rest wait again — degrading those edge cases to sequential single-flight (one compute at a time) instead of an N-way concurrent re-storm. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The cache key buckets on records//WINDOW and edges//WINDOW and try_load requires an exact match. With WINDOW=10 a normal day of capture (+~150 records, +~1300 edges) crossed ~130 buckets, so the on-disk graph cache MISSED on essentially every WAKE and the full community detection was recomputed each time. Edges churn fastest, so they are the binding term. WINDOW=250 keeps the cache valid across a normal day, so the common WAKE is now a cheap cache HIT. The independent age/dirty fuse in consult_overlay (25h / dirty>50) remains the real freshness backstop, and the single-flight gate makes the rare genuine miss harmless. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

_boot_preload called build_runtime_graph (which already persists the cache, with the full node_payload, on a miss) and then called save(..., node_payload= None, ...) again, overwriting the good cache with a payload-less one. That forced a pandas re-read of every record on the next cache hit. Just warm the cache via build_runtime_graph and drop the redundant second save. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The daemon only enters WAKE at boot if ~/.iai-mcp/wake.signal exists, but nothing ever created it — WakeHandler only consumed it. The CLI start/install path (and the operator's capture hook) brought the daemon up with a plain launchctl kickstart, so it re-read its persisted HIBERNATION state and hibernate-exited within a tick, closing the socket before it ever served recall. Add WakeHandler.signal_wake() (symmetric to consume_wake_signal) and create the signal before the kickstart in daemon install/start, so the booting daemon transitions HIBERNATION -> WAKE and serves its socket. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

build_runtime_graph added every record (and every edge) to the graph, including soft-deleted / deduped / erased records (tombstoned_at IS NOT NULL). That polluted communities, centrality, rich_club and the sigma topology audit with dead nodes, and -- worse -- desynced the node count from store.active_records_count() (the payload-cache validity anchor), so after any tombstoning (e.g. migrate --dedupe-episodic) the cache was permanently invalid and every WAKE did a full rebuild on an over-large graph. Skip tombstoned rows in the node loop (matching active_records_count: tombstoned_at IS NULL), skip edges whose endpoints are not live nodes (add_edge does setdefault on both endpoints, so it would re-create dead nodes), and drop the cached assignment/rich_club when the live node set changed so they are recomputed on the fresh graph instead of referencing dead nodes. On a real store this took the graph from 9733 nodes to 3612, rich_club from 974 to 362, and restored payload-cache hits across builds.

_store_is_empty() caught (OSError, ValueError, KeyError, RuntimeError) and returned True. All Hippo store errors (HippoIntegrityError, HippoLockHeldError, ConsolidationPendingError, HippoDecryptError) subclass RuntimeError, and count_rows() raises HippoIntegrityError when the shared sqlite connection is left in an error state by a concurrent heavy reader. Returning True there parks the whole lifecycle tick (no idle-check, no drain) on a store that actually has records. Treat the unknown case as NOT empty so the tick proceeds; a truly empty store just does a little harmless no-op work.

The field was only ever set (on the empty_store/paused skip paths), never cleared, so a single early skip (e.g. a first-tick count race at boot) left a healthy, ticking, draining daemon permanently reporting skip=empty_store in .daemon-state.json — misleading observability that reads as a parked lifecycle.

The lifecycle idle countdown only refreshed `_last_active_monotonic` when the Node wrapper heartbeat file was fresh (`HeartbeatScanner.is_active`). The wrappers dir can be empty (heartbeat stale) while the daemon is still draining a continuously-fed deferred-capture backlog. In that state the idle timer grew unconditionally and the FSM forced itself to SLEEP after 30 min even though drain threads were still writing to the store. Entering the SLEEP pipeline escalates to an EXCLUSIVE store lock, so this contended with the in-flight drain; and because crisis re-arming only runs in SLEEP, an oscillating/never-settling daemon could silently stop re-arming crisis detection. Fold two more activity signals into the idle countdown, alongside the wrapper heartbeat: - in-flight drain state: `capture.is_drain_in_progress()`, a thread-safe depth counter set by `drain_deferred_captures` / `drain_active_live_captures` for their whole duration; - recent real RPC traffic: `mcp_socket.last_activity_ts` (already used by the sleep-pipeline interrupt check, now also by the countdown). The decision is centralized in a pure, unit-testable helper `_idle_countdown_decision`. A genuinely idle daemon still advances to DROWSY/SLEEP exactly as before, so crisis re-arming keeps running; only an actively-working daemon is held awake. Explicit FORCE_SLEEP/user-sleep requests are unaffected. Add tests asserting the daemon does NOT advance toward SLEEP while a drain is in progress (or RPC is recent), that a truly idle daemon still sleeps, and that the in-progress flag is set across the production drain wrappers and released on exception. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Covers 2cffb35: a successful tick clears a stale last_tick_skipped_reason, plus the paused-skip event/persistence and the no-run_rem_cycle routing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CodeAbra

Thanks — this is the WAKE/idle stability fix that ends the consolidation CPU storm and serves recall on wake. Reviewed the diff (socket-on-wake + interrupt-check tightening), security-clean, CI green. Merging with credit to you.

Marsu6996 marked this pull request as draft June 20, 2026 22:01

Marsu6996 marked this pull request as ready for review June 21, 2026 06:25

Marsu6996 changed the title ~~fix(daemon): hibernate when idle — ignore open connections and watchdog status probes~~ fix(daemon): WAKE stability — collapse the concurrent-recompute CPU storm + serve the socket on wake Jun 21, 2026

CodeAbra mentioned this pull request Jun 21, 2026

Sleep daemon stuck in crisis_mode: cycle loops on a single step (never completes) while sleep-cycle --force works; served recall degrades to flat 1.000 / schema records #17

Closed

Marsu6996 and others added 10 commits June 21, 2026 17:02

test(daemon): guard tick-flag observability and the skip-reason reset

d03e024

Covers 2cffb35: a successful tick clears a stale last_tick_skipped_reason, plus the paused-skip event/persistence and the no-run_rem_cycle routing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Marsu6996 force-pushed the fix/daemon-hibernate-when-idle branch from 52ffba5 to 6bfe1f8 Compare June 21, 2026 15:33

Marsu6996 changed the title ~~fix(daemon): WAKE stability — collapse the concurrent-recompute CPU storm + serve the socket on wake~~ fix(daemon): WAKE/idle stability — end the consolidation CPU storm and serve recall on wake Jun 21, 2026

Marsu6996 force-pushed the fix/daemon-hibernate-when-idle branch from 6bfe1f8 to d03e024 Compare June 21, 2026 18:23

CodeAbra approved these changes Jun 22, 2026

View reviewed changes

CodeAbra merged commit fb93ace into CodeAbra:main Jun 22, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(daemon): WAKE/idle stability — end the consolidation CPU storm and serve recall on wake#20

fix(daemon): WAKE/idle stability — end the consolidation CPU storm and serve recall on wake#20
CodeAbra merged 10 commits into
CodeAbra:mainfrom
Marsu6996:fix/daemon-hibernate-when-idle

Marsu6996 commented Jun 20, 2026 •

edited

Loading

Uh oh!

Marsu6996 commented Jun 20, 2026

Uh oh!

CodeAbra commented Jun 21, 2026

Uh oh!

Marsu6996 commented Jun 21, 2026

Uh oh!

CodeAbra left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Marsu6996 commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed (one logical change per commit)

Type of change

Affected areas

Testing

Benchmarks

Notes for reviewers

Uh oh!

Marsu6996 commented Jun 20, 2026

Uh oh!

CodeAbra commented Jun 21, 2026

Uh oh!

Marsu6996 commented Jun 21, 2026

Uh oh!

CodeAbra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Marsu6996 commented Jun 20, 2026 •

edited

Loading