Sleep daemon stuck in crisis_mode: cycle loops on a single step (never completes) while `sleep-cycle --force` works; served recall degrades to flat 1.000 / schema records

## Summary

On the launchd sleep daemon, the consolidation cycle gets stuck in `crisis_mode: true` and **re-starts a single step every ~37s without ever completing it**. The looping step *varies across daemon restarts* — observed in sequence: `KNOB_TUNE`, then `HIPPO_CLEANUP`, then `SCHEMA_MINE`. CPU sits >100% for hours.

Crucially, `iai-mcp maintenance sleep-cycle --force` runs **all 13 steps in ~2s** and clears `crisis_mode`. So the step logic itself works; something in the daemon's scheduling/retry path never advances past `chunk_idx=0`.

While stuck, **recall served by the daemon degrades** to flat `1.000` scores dominated by `Schema: tags:…` records — even for cues whose content is definitely in the store.

## Environment
- `iai` 1.1.2 (latest), `hnswlib` 0.8.0, Python 3.11.15, macOS 26.5.1
- launchd daemon `com.iai-mcp.daemon`

## Symptoms

`lifecycle_state.json`:
```
"current_state": "SLEEP",
"sleep_cycle_progress": { "attempt": 0, "last_error": "deferred:step=SCHEMA_MINE:chunk_idx=0" },
"crisis_mode": true
```

`sleep_step_started` vs `sleep_step_completed` counts from a single day's `lifecycle-events-*.jsonl`:

| step | started | completed |
|---|---|---|
| `KNOB_TUNE` | 407 | 2 |
| `HIPPO_CLEANUP` | 530 | 2 |
| `SCHEMA_MINE` | 678 | 2 |

(the 2 completions per step are manual `--force` runs; the daemon itself never completes them.)

There is **no traceback** in `~/.iai-mcp/logs/launchd-stderr.log` for the looping step — the failure is "deferred" (`deferred:step=…:chunk_idx=0`), so the standard logs contain nothing actionable about *why* the step fails.

## Key signal: daemon vs. `--force`
- `iai-mcp maintenance sleep-cycle --force --reset-quarantine` → all 13 steps `ok` in ~1.7–2.4s, `crisis_mode: false`.
- The daemon on its normal schedule re-enters the loop within the same day, on a *different* step each time.

## Recall degradation while stuck
- Recall **via the daemon** (in `crisis_mode`): flat `1.000` scores, results dominated by `Schema:` pattern records, regardless of cue.
- Direct/offline recall (daemon stopped, fresh index from a forced cycle): **discriminating** scores (~0.73–0.78) with real content.

This suggests the looping `SCHEMA_MINE` (which mints schema records) + the persistent `crisis_mode` are what degrade served recall.

## Possibly related
- #16 (deferred-captures vs drain race): same "deferred" subsystem, but a different trigger (bulk parallel import). Not the cause here — no `permanent-failed` files, no bulk import.
- #15 fixed one concrete instance: `hippo_hnsw_rebuild_failed` (`load_index` dropping `allow_replace_deleted`) was the original `KNOB_TUNE` blocker. After that fix `KNOB_TUNE` no longer raises it, but the daemon cycle still fails to advance — now surfacing on later steps.

## Questions / asks
1. What does `deferred:step=X:chunk_idx=0` represent, and why would a step loop on the daemon yet complete under `--force`?
2. Is there a retry/quarantine path that re-arms the step without surfacing the underlying exception? Could the deferred failure be logged with a traceback?
3. Is the flat-`1.000` / schema-dominated recall an intended degraded mode under `crisis_mode`, or a side effect?

Happy to attach more logs or test a patch.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sleep daemon stuck in crisis_mode: cycle loops on a single step (never completes) while `sleep-cycle --force` works; served recall degrades to flat 1.000 / schema records #17

Summary

Environment

Symptoms

Key signal: daemon vs. `--force`

Recall degradation while stuck

Possibly related

Questions / asks

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Sleep daemon stuck in crisis_mode: cycle loops on a single step (never completes) while sleep-cycle --force works; served recall degrades to flat 1.000 / schema records #17

Description

Summary

Environment

Symptoms

Key signal: daemon vs. --force

Recall degradation while stuck

Possibly related

Questions / asks

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Sleep daemon stuck in crisis_mode: cycle loops on a single step (never completes) while `sleep-cycle --force` works; served recall degrades to flat 1.000 / schema records #17

Key signal: daemon vs. `--force`