Skip to content

fix(daemon): WAKE/idle stability — end the consolidation CPU storm and serve recall on wake#20

Merged
CodeAbra merged 10 commits into
CodeAbra:mainfrom
Marsu6996:fix/daemon-hibernate-when-idle
Jun 22, 2026
Merged

fix(daemon): WAKE/idle stability — end the consolidation CPU storm and serve recall on wake#20
CodeAbra merged 10 commits into
CodeAbra:mainfrom
Marsu6996:fix/daemon-hibernate-when-idle

Conversation

@Marsu6996

@Marsu6996 Marsu6996 commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Summary

At WAKE the daemon never actually idled. The consolidation interrupt-check
treated an open MCP socket — and the liveness watchdog's own {"type":"status"}
probe — as "activity", so the sleep cycle never completed and the wake hook
re-fired every ~30 s. On top of that, on a graph-cache miss every background
subsystem (boot preload, sigma audit, foraging, the hippea cascade) recomputed
the full community graph concurrently under the GIL until the liveness
watchdog's socket probe timed out, the daemon SIGKILLed itself, and launchd
relaunched it — a self-sustaining CPU storm. This is the root cause the
maintainer deferred from #17 to this PR. The daemon now genuinely hibernates
when idle, computes the runtime graph once per wake, and reliably serves its
socket on wake; CPU at rest drops from a continuous 90–300 % to 0 %.

This supersedes the original "hibernate when idle" change (kept as the first
commit) and is rebased onto v1.1.4.

What changed (one logical change per commit)

  • Hibernate when idle — the interrupt-check keys off last_activity_ts
    recency only, and the socket server refreshes last_activity_ts solely for
    dispatched JSON-RPC method calls (real recall/capture), never for control-plane
    status probes. This is the core "always-busy" defect.
  • Single-flight build_runtime_graph — concurrent WAKE callers collapse into
    one compute (leader computes + saves the cache, followers reload it cheaply),
    with a bounded re-contend loop that degrades a leader failure / key-shift to
    sequential single-flight, never an N-way re-storm. Recall is independent of
    the community assignment, so a slightly-stale shared result is safe.
  • Widen the cache staleness window 10 → 250 — a normal day of capture
    (+~150 records / +~1300 edges) crossed ~130 key buckets at WINDOW=10, missing
    the cache on essentially every WAKE. The 25 h / dirty>50 fuse in
    consult_overlay remains the real freshness backstop.
  • Drop the boot-preload double-save that overwrote the good cache with a
    node_payload=None one, forcing a pandas re-read of every record on the next hit.
  • Signal wake before kickstartWakeHandler.signal_wake() (symmetric to
    consume_wake_signal) + a best-effort signal before the launchctl kickstart, so
    a hibernating daemon enters WAKE and serves recall instead of hibernate-exiting
    within a tick. Becomes essential once hibernation actually works.
  • Exclude tombstoned records from the runtime graphbuild_runtime_graph
    added every record/edge including soft-deleted / deduped ones, desyncing the
    node count from active_records_count() (the cache-validity anchor) so the cache
    was permanently invalid and every wake did a full rebuild on an oversized graph.
    On a real store this is 9733 → 3612 nodes. The cached community assignment is
    dropped only when the live set shrank (len(cached) > records_count, i.e.
    tombstoning) so dead nodes are recomputed away; ordinary growth keeps the cached
    assignment so a single insert is still absorbed by the staleness window without
    an O(n²) recompute (regression-guarded by test_runtime_graph_cache).
  • A store-count failure must not read as empty_store_is_empty() returned
    True on any exception; a transient HippoIntegrityError (sqlite left in an
    error state by a heavy reader) then parked the whole tick on a populated store.
    The unknown case now returns False so the tick proceeds.
  • Reset last_tick_skipped_reason on a successful tick — a single early skip
    left a healthy, draining daemon permanently reporting skip=empty_store in
    .daemon-state.json. Cleared on any non-skipped tick.
  • Keep the idle countdown awake during active drains and RPC — the countdown
    only read the Node wrapper heartbeat, so a stale heartbeat during a live
    deferred-capture drain forced SLEEP (an EXCLUSIVE store lock) into the in-flight
    drain. It now also folds capture.is_drain_in_progress() and recent RPC via a
    pure, unit-tested _idle_countdown_decision helper. A genuinely idle daemon
    still advances to DROWSY/SLEEP; explicit FORCE_SLEEP/user-sleep is unaffected.

Type of change

  • Bug fix
  • New feature
  • Refactor (no behaviour change)
  • Documentation
  • Build / tooling

Affected areas

  • Capture path
  • Recall / retrieval
  • Consolidation / sleep cycles
  • Daemon lifecycle / FSM
  • Storage / encryption at rest
  • MCP wrapper (TypeScript)
  • Bench harness
  • CLI / doctor
  • Other: ___

Testing

  • pytest passes locally — full default gate
    (pytest -m "not perf and not slow and not live", 3538 tests), rebased on
    v1.1.4: 3514 passed, 33 skipped, 1 xfailed, 1 failed. The single failure is
    test_daemon_fdlimit_and_fsm.py::TestPlistRendersFdFloor::test_rendered_plist_contains_fd_floor,
    which is pre-existing — it fails identically on a clean v1.1.4 checkout
    (isolated, 0.13 s), so it is not introduced by this branch. (A set of
    test_doctor_* rows is environment-flaky on the dev machine — subprocess.run
    decoding a large system output as strict UTF-8 — and passes on a clean run.)
  • ruff check src/ tests/ — no new findings vs the v1.1.3 baseline on the
    touched files (the repo ships no ruff config; pre-existing default-rule findings
    are untouched).
  • New tests added for changed behaviour — single-flight collapse,
    socket activity tracking (probe vs RPC), tombstone exclusion, store-empty
    semantics, drain-aware idle countdown, and tick-flag observability.

Benchmarks

This PR changes daemon scheduling / CPU, not retrieval quality: recall is
independent of the community assignment, and no ranking or scoring path is
touched, so bench.* retrieval-quality numbers are unaffected by construction.

Live worst-case verification on a real ~4 k-record store (cold cache, all
subsystems forced at boot, default watchdog, status probe hammered for 90 s):

  • Before: CPU a continuous 90–300 % (9–12 cores during the storm); watchdog
    socket probe times out; periodic KeepAlive=Crashed relaunch loop.
  • After: probe latency 0.07–0.74 s throughout; 0 new forensic dumps, 0
    wedge-kills
    ; CPU 1–3 cores during the single compute, then 0 % at rest.

Notes for reviewers

@Marsu6996 Marsu6996 marked this pull request as draft June 20, 2026 22:01
@Marsu6996

Copy link
Copy Markdown
Contributor Author

Converting to draft to hold the merge for now — not a problem with the change itself, just being cautious.

While investigating daemon stability on my setup I ran into a separate, not-yet-characterised behaviour: when the daemon is forced into WAKE, the graph-similarity computation (gemm/rayon in the native module) drives CPU very high (~9 cores). I stopped it before I could tell whether it eventually converges or is simply being cut short by the watchdog status probe (5s timeout, 3 strikes → self-SIGKILL), so I don't yet know if it's a real runaway or just heavy-but-bounded work killed too early.

Since this PR changes when the daemon hibernates vs. stays awake, I'd rather confirm it doesn't mask or interact with that WAKE-path behaviour before it lands. I'll re-mark it ready once I've measured it properly (let it run in WAKE with the watchdog relaxed and see if CPU settles). #19 is independent of this and stays ready.

@Marsu6996 Marsu6996 marked this pull request as ready for review June 21, 2026 06:25
@Marsu6996 Marsu6996 changed the title fix(daemon): hibernate when idle — ignore open connections and watchdog status probes fix(daemon): WAKE stability — collapse the concurrent-recompute CPU storm + serve the socket on wake Jun 21, 2026
@CodeAbra

Copy link
Copy Markdown
Owner

Thanks for digging into this, and sorry I went quiet. I ran into the same bug on my side and have been working through a fix since yesterday, which is why I was slow to reply. Your write-up lines up with what I was seeing.

Let me get my version landed first, then I'll go through your PR properly so your find gets credited. Really appreciate the careful report.

@Marsu6996

Copy link
Copy Markdown
Contributor Author

Thanks — glad it lines up with what you were seeing, and good that you've got a fix in flight too.

No problem at all landing yours first. Please take anything useful from this PR, and don't worry about attribution on my account — I'm mainly just glad the bug's getting closed.

One thing in case it's useful for sanity-checking your version: the bit I'd been unsure about — whether the WAKE CPU is a real runaway or just heavy-but-bounded work the watchdog cuts short — is settled; it's bounded. The trigger is the graph cache missing on essentially every wake (staleness window of 10), so several subsystems recompute the full community detection concurrently and starve the event loop until the status probe times out.

Happy to rebase/trim this onto your fix or close it in favour of yours, whatever's least work for you. Thanks for the quick reply.

Marsu6996 and others added 10 commits June 21, 2026 17:02
…og status probes

The sleep/consolidation pipeline defers whenever _interrupt_check reports recent
activity. Two independent signals wrongly marked the daemon "active" on nearly
every tick, so it never completed a cycle, never hibernated, and the wake-hook
re-ran every 30s — a sustained ~200% CPU churn on any long-lived deployment:

1. _interrupt_check returned True whenever mcp_socket.active_connections > 0.
   Long-lived MCP clients hold their socket open permanently -> always True.
2. Even after removing (1), last_activity_ts was refreshed for EVERY inbound
   socket line — including the watchdog's own {"type": "status"} liveness probe
   sent every 7-30s (daemon/_watchdog.py::_probe_status_roundtrip). So the
   30s-activity window never elapsed.

Fix: _interrupt_check keys off last_activity_ts recency only, and SocketServer
refreshes last_activity_ts only for dispatched JSON-RPC method calls (real
recall/capture traffic), never for control-plane messages. A busy burst still
defers consolidation; a 30s lull now lets the cycle finish and the daemon
hibernate.

Adds tests/test_socket_activity_tracking.py locking in that a status probe does
not count as activity while a real method call does.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ute storm

At WAKE several background subsystems (boot preload, sigma identity audit,
foraging weak-bridge detection, hippea cascade) each call build_runtime_graph
concurrently in their own asyncio.to_thread workers. On a cache miss each one
independently ran the full, GIL-bound community detection (mosaic). Three+ at
once contended for the GIL, starved the asyncio event loop, and the liveness
watchdog's socket probe timed out -> SIGKILL -> launchd relaunch -> loop.

Wrap build_runtime_graph in a single-flight gate keyed on the cache key: the
first caller (leader) computes and saves the on-disk cache; concurrent callers
(followers) wait on an Event and then reload the freshly-saved cache via the
existing cheap path. No mutable MemoryGraph is shared between callers (each
rebuilds its own shell + single-slot sync hook), and recall is independent of
the community assignment, so a slightly-stale shared result is harmless.

Followers re-contend in a bounded loop rather than recomputing unconditionally:
if the leader fails before saving, the cache key shifts mid-burst, or the wait
times out, the woken followers loop back and exactly one becomes the next
leader while the rest wait again — degrading those edge cases to sequential
single-flight (one compute at a time) instead of an N-way concurrent re-storm.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The cache key buckets on records//WINDOW and edges//WINDOW and try_load
requires an exact match. With WINDOW=10 a normal day of capture (+~150
records, +~1300 edges) crossed ~130 buckets, so the on-disk graph cache MISSED
on essentially every WAKE and the full community detection was recomputed each
time. Edges churn fastest, so they are the binding term.

WINDOW=250 keeps the cache valid across a normal day, so the common WAKE is now
a cheap cache HIT. The independent age/dirty fuse in consult_overlay (25h /
dirty>50) remains the real freshness backstop, and the single-flight gate makes
the rare genuine miss harmless.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_boot_preload called build_runtime_graph (which already persists the cache,
with the full node_payload, on a miss) and then called save(..., node_payload=
None, ...) again, overwriting the good cache with a payload-less one. That
forced a pandas re-read of every record on the next cache hit. Just warm the
cache via build_runtime_graph and drop the redundant second save.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The daemon only enters WAKE at boot if ~/.iai-mcp/wake.signal exists, but
nothing ever created it — WakeHandler only consumed it. The CLI start/install
path (and the operator's capture hook) brought the daemon up with a plain
launchctl kickstart, so it re-read its persisted HIBERNATION state and
hibernate-exited within a tick, closing the socket before it ever served recall.

Add WakeHandler.signal_wake() (symmetric to consume_wake_signal) and create the
signal before the kickstart in daemon install/start, so the booting daemon
transitions HIBERNATION -> WAKE and serves its socket.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
build_runtime_graph added every record (and every edge) to the graph,
including soft-deleted / deduped / erased records (tombstoned_at IS NOT NULL).
That polluted communities, centrality, rich_club and the sigma topology audit
with dead nodes, and -- worse -- desynced the node count from
store.active_records_count() (the payload-cache validity anchor), so after any
tombstoning (e.g. migrate --dedupe-episodic) the cache was permanently invalid
and every WAKE did a full rebuild on an over-large graph.

Skip tombstoned rows in the node loop (matching active_records_count:
tombstoned_at IS NULL), skip edges whose endpoints are not live nodes (add_edge
does setdefault on both endpoints, so it would re-create dead nodes), and drop
the cached assignment/rich_club when the live node set changed so they are
recomputed on the fresh graph instead of referencing dead nodes.

On a real store this took the graph from 9733 nodes to 3612, rich_club from 974
to 362, and restored payload-cache hits across builds.
_store_is_empty() caught (OSError, ValueError, KeyError, RuntimeError) and
returned True. All Hippo store errors (HippoIntegrityError, HippoLockHeldError,
ConsolidationPendingError, HippoDecryptError) subclass RuntimeError, and
count_rows() raises HippoIntegrityError when the shared sqlite connection is
left in an error state by a concurrent heavy reader. Returning True there parks
the whole lifecycle tick (no idle-check, no drain) on a store that actually has
records. Treat the unknown case as NOT empty so the tick proceeds; a truly empty
store just does a little harmless no-op work.
The field was only ever set (on the empty_store/paused skip paths), never
cleared, so a single early skip (e.g. a first-tick count race at boot) left a
healthy, ticking, draining daemon permanently reporting skip=empty_store in
.daemon-state.json — misleading observability that reads as a parked lifecycle.
The lifecycle idle countdown only refreshed `_last_active_monotonic` when
the Node wrapper heartbeat file was fresh (`HeartbeatScanner.is_active`).
The wrappers dir can be empty (heartbeat stale) while the daemon is still
draining a continuously-fed deferred-capture backlog. In that state the
idle timer grew unconditionally and the FSM forced itself to SLEEP after
30 min even though drain threads were still writing to the store. Entering
the SLEEP pipeline escalates to an EXCLUSIVE store lock, so this contended
with the in-flight drain; and because crisis re-arming only runs in SLEEP,
an oscillating/never-settling daemon could silently stop re-arming crisis
detection.

Fold two more activity signals into the idle countdown, alongside the
wrapper heartbeat:

- in-flight drain state: `capture.is_drain_in_progress()`, a thread-safe
  depth counter set by `drain_deferred_captures` / `drain_active_live_captures`
  for their whole duration;
- recent real RPC traffic: `mcp_socket.last_activity_ts` (already used by
  the sleep-pipeline interrupt check, now also by the countdown).

The decision is centralized in a pure, unit-testable helper
`_idle_countdown_decision`. A genuinely idle daemon still advances to
DROWSY/SLEEP exactly as before, so crisis re-arming keeps running; only an
actively-working daemon is held awake. Explicit FORCE_SLEEP/user-sleep
requests are unaffected.

Add tests asserting the daemon does NOT advance toward SLEEP while a drain
is in progress (or RPC is recent), that a truly idle daemon still sleeps,
and that the in-progress flag is set across the production drain wrappers
and released on exception.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Covers 2cffb35: a successful tick clears a stale last_tick_skipped_reason,
plus the paused-skip event/persistence and the no-run_rem_cycle routing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Marsu6996 Marsu6996 force-pushed the fix/daemon-hibernate-when-idle branch from 52ffba5 to 6bfe1f8 Compare June 21, 2026 15:33
@Marsu6996 Marsu6996 changed the title fix(daemon): WAKE stability — collapse the concurrent-recompute CPU storm + serve the socket on wake fix(daemon): WAKE/idle stability — end the consolidation CPU storm and serve recall on wake Jun 21, 2026
@Marsu6996 Marsu6996 force-pushed the fix/daemon-hibernate-when-idle branch from 6bfe1f8 to d03e024 Compare June 21, 2026 18:23

@CodeAbra CodeAbra left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — this is the WAKE/idle stability fix that ends the consolidation CPU storm and serves recall on wake. Reviewed the diff (socket-on-wake + interrupt-check tightening), security-clean, CI green. Merging with credit to you.

@CodeAbra CodeAbra merged commit fb93ace into CodeAbra:main Jun 22, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants