Summary
On the launchd sleep daemon, the consolidation cycle gets stuck in crisis_mode: true and re-starts a single step every ~37s without ever completing it. The looping step varies across daemon restarts — observed in sequence: KNOB_TUNE, then HIPPO_CLEANUP, then SCHEMA_MINE. CPU sits >100% for hours.
Crucially, iai-mcp maintenance sleep-cycle --force runs all 13 steps in ~2s and clears crisis_mode. So the step logic itself works; something in the daemon's scheduling/retry path never advances past chunk_idx=0.
While stuck, recall served by the daemon degrades to flat 1.000 scores dominated by Schema: tags:… records — even for cues whose content is definitely in the store.
Environment
iai 1.1.2 (latest), hnswlib 0.8.0, Python 3.11.15, macOS 26.5.1
- launchd daemon
com.iai-mcp.daemon
Symptoms
lifecycle_state.json:
"current_state": "SLEEP",
"sleep_cycle_progress": { "attempt": 0, "last_error": "deferred:step=SCHEMA_MINE:chunk_idx=0" },
"crisis_mode": true
sleep_step_started vs sleep_step_completed counts from a single day's lifecycle-events-*.jsonl:
| step |
started |
completed |
KNOB_TUNE |
407 |
2 |
HIPPO_CLEANUP |
530 |
2 |
SCHEMA_MINE |
678 |
2 |
(the 2 completions per step are manual --force runs; the daemon itself never completes them.)
There is no traceback in ~/.iai-mcp/logs/launchd-stderr.log for the looping step — the failure is "deferred" (deferred:step=…:chunk_idx=0), so the standard logs contain nothing actionable about why the step fails.
Key signal: daemon vs. --force
iai-mcp maintenance sleep-cycle --force --reset-quarantine → all 13 steps ok in ~1.7–2.4s, crisis_mode: false.
- The daemon on its normal schedule re-enters the loop within the same day, on a different step each time.
Recall degradation while stuck
- Recall via the daemon (in
crisis_mode): flat 1.000 scores, results dominated by Schema: pattern records, regardless of cue.
- Direct/offline recall (daemon stopped, fresh index from a forced cycle): discriminating scores (~0.73–0.78) with real content.
This suggests the looping SCHEMA_MINE (which mints schema records) + the persistent crisis_mode are what degrade served recall.
Possibly related
Questions / asks
- What does
deferred:step=X:chunk_idx=0 represent, and why would a step loop on the daemon yet complete under --force?
- Is there a retry/quarantine path that re-arms the step without surfacing the underlying exception? Could the deferred failure be logged with a traceback?
- Is the flat-
1.000 / schema-dominated recall an intended degraded mode under crisis_mode, or a side effect?
Happy to attach more logs or test a patch.
Summary
On the launchd sleep daemon, the consolidation cycle gets stuck in
crisis_mode: trueand re-starts a single step every ~37s without ever completing it. The looping step varies across daemon restarts — observed in sequence:KNOB_TUNE, thenHIPPO_CLEANUP, thenSCHEMA_MINE. CPU sits >100% for hours.Crucially,
iai-mcp maintenance sleep-cycle --forceruns all 13 steps in ~2s and clearscrisis_mode. So the step logic itself works; something in the daemon's scheduling/retry path never advances pastchunk_idx=0.While stuck, recall served by the daemon degrades to flat
1.000scores dominated bySchema: tags:…records — even for cues whose content is definitely in the store.Environment
iai1.1.2 (latest),hnswlib0.8.0, Python 3.11.15, macOS 26.5.1com.iai-mcp.daemonSymptoms
lifecycle_state.json:sleep_step_startedvssleep_step_completedcounts from a single day'slifecycle-events-*.jsonl:KNOB_TUNEHIPPO_CLEANUPSCHEMA_MINE(the 2 completions per step are manual
--forceruns; the daemon itself never completes them.)There is no traceback in
~/.iai-mcp/logs/launchd-stderr.logfor the looping step — the failure is "deferred" (deferred:step=…:chunk_idx=0), so the standard logs contain nothing actionable about why the step fails.Key signal: daemon vs.
--forceiai-mcp maintenance sleep-cycle --force --reset-quarantine→ all 13 stepsokin ~1.7–2.4s,crisis_mode: false.Recall degradation while stuck
crisis_mode): flat1.000scores, results dominated bySchema:pattern records, regardless of cue.This suggests the looping
SCHEMA_MINE(which mints schema records) + the persistentcrisis_modeare what degrade served recall.Possibly related
permanent-failedfiles, no bulk import.hippo_hnsw_rebuild_failed(load_indexdroppingallow_replace_deleted) was the originalKNOB_TUNEblocker. After that fixKNOB_TUNEno longer raises it, but the daemon cycle still fails to advance — now surfacing on later steps.Questions / asks
deferred:step=X:chunk_idx=0represent, and why would a step loop on the daemon yet complete under--force?1.000/ schema-dominated recall an intended degraded mode undercrisis_mode, or a side effect?Happy to attach more logs or test a patch.