Skip to content

Sleep daemon stuck in crisis_mode: cycle loops on a single step (never completes) while sleep-cycle --force works; served recall degrades to flat 1.000 / schema records #17

Description

@Marsu6996

Summary

On the launchd sleep daemon, the consolidation cycle gets stuck in crisis_mode: true and re-starts a single step every ~37s without ever completing it. The looping step varies across daemon restarts — observed in sequence: KNOB_TUNE, then HIPPO_CLEANUP, then SCHEMA_MINE. CPU sits >100% for hours.

Crucially, iai-mcp maintenance sleep-cycle --force runs all 13 steps in ~2s and clears crisis_mode. So the step logic itself works; something in the daemon's scheduling/retry path never advances past chunk_idx=0.

While stuck, recall served by the daemon degrades to flat 1.000 scores dominated by Schema: tags:… records — even for cues whose content is definitely in the store.

Environment

  • iai 1.1.2 (latest), hnswlib 0.8.0, Python 3.11.15, macOS 26.5.1
  • launchd daemon com.iai-mcp.daemon

Symptoms

lifecycle_state.json:

"current_state": "SLEEP",
"sleep_cycle_progress": { "attempt": 0, "last_error": "deferred:step=SCHEMA_MINE:chunk_idx=0" },
"crisis_mode": true

sleep_step_started vs sleep_step_completed counts from a single day's lifecycle-events-*.jsonl:

step started completed
KNOB_TUNE 407 2
HIPPO_CLEANUP 530 2
SCHEMA_MINE 678 2

(the 2 completions per step are manual --force runs; the daemon itself never completes them.)

There is no traceback in ~/.iai-mcp/logs/launchd-stderr.log for the looping step — the failure is "deferred" (deferred:step=…:chunk_idx=0), so the standard logs contain nothing actionable about why the step fails.

Key signal: daemon vs. --force

  • iai-mcp maintenance sleep-cycle --force --reset-quarantine → all 13 steps ok in ~1.7–2.4s, crisis_mode: false.
  • The daemon on its normal schedule re-enters the loop within the same day, on a different step each time.

Recall degradation while stuck

  • Recall via the daemon (in crisis_mode): flat 1.000 scores, results dominated by Schema: pattern records, regardless of cue.
  • Direct/offline recall (daemon stopped, fresh index from a forced cycle): discriminating scores (~0.73–0.78) with real content.

This suggests the looping SCHEMA_MINE (which mints schema records) + the persistent crisis_mode are what degrade served recall.

Possibly related

Questions / asks

  1. What does deferred:step=X:chunk_idx=0 represent, and why would a step loop on the daemon yet complete under --force?
  2. Is there a retry/quarantine path that re-arms the step without surfacing the underlying exception? Could the deferred failure be logged with a traceback?
  3. Is the flat-1.000 / schema-dominated recall an intended degraded mode under crisis_mode, or a side effect?

Happy to attach more logs or test a patch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions