fix(daemon): WAKE/idle stability — end the consolidation CPU storm and serve recall on wake#20
Conversation
|
Converting to draft to hold the merge for now — not a problem with the change itself, just being cautious. While investigating daemon stability on my setup I ran into a separate, not-yet-characterised behaviour: when the daemon is forced into WAKE, the graph-similarity computation (gemm/rayon in the native module) drives CPU very high (~9 cores). I stopped it before I could tell whether it eventually converges or is simply being cut short by the watchdog status probe (5s timeout, 3 strikes → self-SIGKILL), so I don't yet know if it's a real runaway or just heavy-but-bounded work killed too early. Since this PR changes when the daemon hibernates vs. stays awake, I'd rather confirm it doesn't mask or interact with that WAKE-path behaviour before it lands. I'll re-mark it ready once I've measured it properly (let it run in WAKE with the watchdog relaxed and see if CPU settles). #19 is independent of this and stays ready. |
|
Thanks for digging into this, and sorry I went quiet. I ran into the same bug on my side and have been working through a fix since yesterday, which is why I was slow to reply. Your write-up lines up with what I was seeing. Let me get my version landed first, then I'll go through your PR properly so your find gets credited. Really appreciate the careful report. |
|
Thanks — glad it lines up with what you were seeing, and good that you've got a fix in flight too. No problem at all landing yours first. Please take anything useful from this PR, and don't worry about attribution on my account — I'm mainly just glad the bug's getting closed. One thing in case it's useful for sanity-checking your version: the bit I'd been unsure about — whether the WAKE CPU is a real runaway or just heavy-but-bounded work the watchdog cuts short — is settled; it's bounded. The trigger is the graph cache missing on essentially every wake (staleness window of 10), so several subsystems recompute the full community detection concurrently and starve the event loop until the status probe times out. Happy to rebase/trim this onto your fix or close it in favour of yours, whatever's least work for you. Thanks for the quick reply. |
…og status probes
The sleep/consolidation pipeline defers whenever _interrupt_check reports recent
activity. Two independent signals wrongly marked the daemon "active" on nearly
every tick, so it never completed a cycle, never hibernated, and the wake-hook
re-ran every 30s — a sustained ~200% CPU churn on any long-lived deployment:
1. _interrupt_check returned True whenever mcp_socket.active_connections > 0.
Long-lived MCP clients hold their socket open permanently -> always True.
2. Even after removing (1), last_activity_ts was refreshed for EVERY inbound
socket line — including the watchdog's own {"type": "status"} liveness probe
sent every 7-30s (daemon/_watchdog.py::_probe_status_roundtrip). So the
30s-activity window never elapsed.
Fix: _interrupt_check keys off last_activity_ts recency only, and SocketServer
refreshes last_activity_ts only for dispatched JSON-RPC method calls (real
recall/capture traffic), never for control-plane messages. A busy burst still
defers consolidation; a 30s lull now lets the cycle finish and the daemon
hibernate.
Adds tests/test_socket_activity_tracking.py locking in that a status probe does
not count as activity while a real method call does.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ute storm At WAKE several background subsystems (boot preload, sigma identity audit, foraging weak-bridge detection, hippea cascade) each call build_runtime_graph concurrently in their own asyncio.to_thread workers. On a cache miss each one independently ran the full, GIL-bound community detection (mosaic). Three+ at once contended for the GIL, starved the asyncio event loop, and the liveness watchdog's socket probe timed out -> SIGKILL -> launchd relaunch -> loop. Wrap build_runtime_graph in a single-flight gate keyed on the cache key: the first caller (leader) computes and saves the on-disk cache; concurrent callers (followers) wait on an Event and then reload the freshly-saved cache via the existing cheap path. No mutable MemoryGraph is shared between callers (each rebuilds its own shell + single-slot sync hook), and recall is independent of the community assignment, so a slightly-stale shared result is harmless. Followers re-contend in a bounded loop rather than recomputing unconditionally: if the leader fails before saving, the cache key shifts mid-burst, or the wait times out, the woken followers loop back and exactly one becomes the next leader while the rest wait again — degrading those edge cases to sequential single-flight (one compute at a time) instead of an N-way concurrent re-storm. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The cache key buckets on records//WINDOW and edges//WINDOW and try_load requires an exact match. With WINDOW=10 a normal day of capture (+~150 records, +~1300 edges) crossed ~130 buckets, so the on-disk graph cache MISSED on essentially every WAKE and the full community detection was recomputed each time. Edges churn fastest, so they are the binding term. WINDOW=250 keeps the cache valid across a normal day, so the common WAKE is now a cheap cache HIT. The independent age/dirty fuse in consult_overlay (25h / dirty>50) remains the real freshness backstop, and the single-flight gate makes the rare genuine miss harmless. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_boot_preload called build_runtime_graph (which already persists the cache, with the full node_payload, on a miss) and then called save(..., node_payload= None, ...) again, overwriting the good cache with a payload-less one. That forced a pandas re-read of every record on the next cache hit. Just warm the cache via build_runtime_graph and drop the redundant second save. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The daemon only enters WAKE at boot if ~/.iai-mcp/wake.signal exists, but nothing ever created it — WakeHandler only consumed it. The CLI start/install path (and the operator's capture hook) brought the daemon up with a plain launchctl kickstart, so it re-read its persisted HIBERNATION state and hibernate-exited within a tick, closing the socket before it ever served recall. Add WakeHandler.signal_wake() (symmetric to consume_wake_signal) and create the signal before the kickstart in daemon install/start, so the booting daemon transitions HIBERNATION -> WAKE and serves its socket. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
build_runtime_graph added every record (and every edge) to the graph, including soft-deleted / deduped / erased records (tombstoned_at IS NOT NULL). That polluted communities, centrality, rich_club and the sigma topology audit with dead nodes, and -- worse -- desynced the node count from store.active_records_count() (the payload-cache validity anchor), so after any tombstoning (e.g. migrate --dedupe-episodic) the cache was permanently invalid and every WAKE did a full rebuild on an over-large graph. Skip tombstoned rows in the node loop (matching active_records_count: tombstoned_at IS NULL), skip edges whose endpoints are not live nodes (add_edge does setdefault on both endpoints, so it would re-create dead nodes), and drop the cached assignment/rich_club when the live node set changed so they are recomputed on the fresh graph instead of referencing dead nodes. On a real store this took the graph from 9733 nodes to 3612, rich_club from 974 to 362, and restored payload-cache hits across builds.
_store_is_empty() caught (OSError, ValueError, KeyError, RuntimeError) and returned True. All Hippo store errors (HippoIntegrityError, HippoLockHeldError, ConsolidationPendingError, HippoDecryptError) subclass RuntimeError, and count_rows() raises HippoIntegrityError when the shared sqlite connection is left in an error state by a concurrent heavy reader. Returning True there parks the whole lifecycle tick (no idle-check, no drain) on a store that actually has records. Treat the unknown case as NOT empty so the tick proceeds; a truly empty store just does a little harmless no-op work.
The field was only ever set (on the empty_store/paused skip paths), never cleared, so a single early skip (e.g. a first-tick count race at boot) left a healthy, ticking, draining daemon permanently reporting skip=empty_store in .daemon-state.json — misleading observability that reads as a parked lifecycle.
The lifecycle idle countdown only refreshed `_last_active_monotonic` when the Node wrapper heartbeat file was fresh (`HeartbeatScanner.is_active`). The wrappers dir can be empty (heartbeat stale) while the daemon is still draining a continuously-fed deferred-capture backlog. In that state the idle timer grew unconditionally and the FSM forced itself to SLEEP after 30 min even though drain threads were still writing to the store. Entering the SLEEP pipeline escalates to an EXCLUSIVE store lock, so this contended with the in-flight drain; and because crisis re-arming only runs in SLEEP, an oscillating/never-settling daemon could silently stop re-arming crisis detection. Fold two more activity signals into the idle countdown, alongside the wrapper heartbeat: - in-flight drain state: `capture.is_drain_in_progress()`, a thread-safe depth counter set by `drain_deferred_captures` / `drain_active_live_captures` for their whole duration; - recent real RPC traffic: `mcp_socket.last_activity_ts` (already used by the sleep-pipeline interrupt check, now also by the countdown). The decision is centralized in a pure, unit-testable helper `_idle_countdown_decision`. A genuinely idle daemon still advances to DROWSY/SLEEP exactly as before, so crisis re-arming keeps running; only an actively-working daemon is held awake. Explicit FORCE_SLEEP/user-sleep requests are unaffected. Add tests asserting the daemon does NOT advance toward SLEEP while a drain is in progress (or RPC is recent), that a truly idle daemon still sleeps, and that the in-progress flag is set across the production drain wrappers and released on exception. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Covers 2cffb35: a successful tick clears a stale last_tick_skipped_reason, plus the paused-skip event/persistence and the no-run_rem_cycle routing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
52ffba5 to
6bfe1f8
Compare
6bfe1f8 to
d03e024
Compare
CodeAbra
left a comment
There was a problem hiding this comment.
Thanks — this is the WAKE/idle stability fix that ends the consolidation CPU storm and serves recall on wake. Reviewed the diff (socket-on-wake + interrupt-check tightening), security-clean, CI green. Merging with credit to you.
Summary
At WAKE the daemon never actually idled. The consolidation interrupt-check
treated an open MCP socket — and the liveness watchdog's own
{"type":"status"}probe — as "activity", so the sleep cycle never completed and the wake hook
re-fired every ~30 s. On top of that, on a graph-cache miss every background
subsystem (boot preload, sigma audit, foraging, the hippea cascade) recomputed
the full community graph concurrently under the GIL until the liveness
watchdog's socket probe timed out, the daemon SIGKILLed itself, and launchd
relaunched it — a self-sustaining CPU storm. This is the root cause the
maintainer deferred from #17 to this PR. The daemon now genuinely hibernates
when idle, computes the runtime graph once per wake, and reliably serves its
socket on wake; CPU at rest drops from a continuous 90–300 % to 0 %.
This supersedes the original "hibernate when idle" change (kept as the first
commit) and is rebased onto
v1.1.4.What changed (one logical change per commit)
last_activity_tsrecency only, and the socket server refreshes
last_activity_tssolely fordispatched JSON-RPC method calls (real recall/capture), never for control-plane
status probes. This is the core "always-busy" defect.
build_runtime_graph— concurrent WAKE callers collapse intoone compute (leader computes + saves the cache, followers reload it cheaply),
with a bounded re-contend loop that degrades a leader failure / key-shift to
sequential single-flight, never an N-way re-storm. Recall is independent of
the community assignment, so a slightly-stale shared result is safe.
(+~150 records / +~1300 edges) crossed ~130 key buckets at WINDOW=10, missing
the cache on essentially every WAKE. The 25 h / dirty>50 fuse in
consult_overlayremains the real freshness backstop.node_payload=Noneone, forcing a pandas re-read of every record on the next hit.WakeHandler.signal_wake()(symmetric toconsume_wake_signal) + a best-effort signal before the launchctl kickstart, soa hibernating daemon enters WAKE and serves recall instead of hibernate-exiting
within a tick. Becomes essential once hibernation actually works.
build_runtime_graphadded every record/edge including soft-deleted / deduped ones, desyncing the
node count from
active_records_count()(the cache-validity anchor) so the cachewas permanently invalid and every wake did a full rebuild on an oversized graph.
On a real store this is 9733 → 3612 nodes. The cached community assignment is
dropped only when the live set shrank (
len(cached) > records_count, i.e.tombstoning) so dead nodes are recomputed away; ordinary growth keeps the cached
assignment so a single insert is still absorbed by the staleness window without
an O(n²) recompute (regression-guarded by
test_runtime_graph_cache)._store_is_empty()returnedTrueon any exception; a transientHippoIntegrityError(sqlite left in anerror state by a heavy reader) then parked the whole tick on a populated store.
The unknown case now returns
Falseso the tick proceeds.last_tick_skipped_reasonon a successful tick — a single early skipleft a healthy, draining daemon permanently reporting
skip=empty_storein.daemon-state.json. Cleared on any non-skipped tick.only read the Node wrapper heartbeat, so a stale heartbeat during a live
deferred-capture drain forced SLEEP (an EXCLUSIVE store lock) into the in-flight
drain. It now also folds
capture.is_drain_in_progress()and recent RPC via apure, unit-tested
_idle_countdown_decisionhelper. A genuinely idle daemonstill advances to DROWSY/SLEEP; explicit FORCE_SLEEP/user-sleep is unaffected.
Type of change
Affected areas
Testing
pytestpasses locally — full default gate(
pytest -m "not perf and not slow and not live", 3538 tests), rebased onv1.1.4: 3514 passed, 33 skipped, 1 xfailed, 1 failed. The single failure istest_daemon_fdlimit_and_fsm.py::TestPlistRendersFdFloor::test_rendered_plist_contains_fd_floor,which is pre-existing — it fails identically on a clean
v1.1.4checkout(isolated, 0.13 s), so it is not introduced by this branch. (A set of
test_doctor_*rows is environment-flaky on the dev machine —subprocess.rundecoding a large system output as strict UTF-8 — and passes on a clean run.)
ruff check src/ tests/— no new findings vs thev1.1.3baseline on thetouched files (the repo ships no ruff config; pre-existing default-rule findings
are untouched).
socket activity tracking (probe vs RPC), tombstone exclusion, store-empty
semantics, drain-aware idle countdown, and tick-flag observability.
Benchmarks
This PR changes daemon scheduling / CPU, not retrieval quality: recall is
independent of the community assignment, and no ranking or scoring path is
touched, so
bench.*retrieval-quality numbers are unaffected by construction.Live worst-case verification on a real ~4 k-record store (cold cache, all
subsystems forced at boot, default watchdog, status probe hammered for 90 s):
socket probe times out; periodic
KeepAlive=Crashedrelaunch loop.wedge-kills; CPU 1–3 cores during the single compute, then 0 % at rest.
Notes for reviewers
v1.1.4(upstream released v1.1.4 mid-review; our files aredisjoint from its changes, so the rebase was conflict-free). This branch expands
the original PR with the graph-tombstone, store-empty, tick-reason and
drain-aware-idle fixes that the same WAKE/idle failure mode needed.
sleep-cycle --forceworks; served recall degrades to flat 1.000 / schema records #17 — this is the CPU-storm / "daemon treats itself as always-busy"root cause you flagged as the follow-up tracked here. A companion PR handles the
consolidation-correctness half of Sleep daemon stuck in crisis_mode: cycle loops on a single step (never completes) while
sleep-cycle --forceworks; served recall degrades to flat 1.000 / schema records #17 (crisis topology computed on a live,tombstone-filtered graph; recall score clamp; reflection embedding; crash-mid-
SLEEP recovery) and is marked Relates to Sleep daemon stuck in crisis_mode: cycle loops on a single step (never completes) while
sleep-cycle --forceworks; served recall degrades to flat 1.000 / schema records #17 so it does not auto-close theissue ahead of this one.
crisis_mode_since_tstimeout clears a coherent crisis after 72 h; this PRremoves the false WAKE busy-loop that kept re-arming it.
CACHE_VERSIONis intentionally not bumped (the window change alreadyinvalidates old keys); a hygiene bump can follow.
sigma.compute_topology_snapshotrecomputescommunities in-memory after
build_runtime_graphand could reuse the assignmentalready returned.