Fix stale SessionStart cache during active WAKE sessions#25
Conversation
…LEEP
The session-start precache (~/.iai-mcp/.session-start-payload.cached.md)
was regenerated only by the post-sleep-pipeline hook, so a daemon that
never reached SLEEP — the common case during an active Claude/Codex
session, where HID idle and RPC traffic keep the wrappers warm — left
the precache file frozen for days while the store kept ingesting
records. Field evidence: on one box the cache was 5 days stale with
+8986 active records and zero session_start_cache_write_failed events
in the ledger.
* Add `_maybe_refresh_session_start_cache`, a best-effort WAKE-time
refresh gated by a single-flight `threading.Lock`, an env-tunable
min-interval (`IAI_MCP_SESSION_CACHE_REFRESH_MIN_SEC`, default 60s),
a watermark sidecar JSON (records_count + monotone MAX(vec_label) +
MAX(created_at) + MAX(updated_at), so a record inserted now with an
*old* created_at — a transcript backfill — still triggers a refresh),
and a `_runtime_graph_cache_is_warm` probe that skips with reason
`runtime_graph_cache_cold` rather than spawn a heavyweight rebuild on
a live tick. Two new lifecycle-tick call sites: one right after
`pending_embeddings_wake_sequence` when reembed/ingest moved (the
reactive path) and one on every WAKE/DROWSY tick (the periodic safety
net). The SLEEP-pipeline call now passes `force_rebuild=True` —
inside its EXCLUSIVE lock the rebuild cost is fine.
* Rich telemetry on every refresh attempt:
- `session_start_cache_write_started` (trigger, cache_path,
force_rebuild)
- `session_start_cache_write_success` (rendered_chars, records_count,
max_vec_label, max_record_created_at, max_updated_at, duration_ms)
- `session_start_cache_write_skipped` (reason ∈ {refresh_in_progress,
runtime_graph_cache_cold, empty_render, min_interval_not_elapsed,
no_new_records, meta_absent, cache_absent, probe_failed})
- `session_start_cache_write_failed` (reason=ExcType, error, duration)
Every emit is wrapped so an event-write failure cannot crash the
refresh path, and the refresh path itself cannot crash the tick.
* `_default_session_start_cache_meta_path` centralises the sidecar
derivation so the writer and the reader cannot drift onto different
filenames; a regression test pins the round-trip.
* Hook safety net (iai-mcp-session-recall.sh): default-on 12h TTL via
`IAI_MCP_SESSION_CACHE_MAX_AGE_SEC` (set to `0` to keep the legacy
serve-regardless-of-age behaviour explicitly). If the daemon is down
or its regeneration silently fails, a multi-day-old precache no
longer leaks into every new session.
Tests: 18 new cases covering watermark probe (vec_label catches
old-created_at inserts), should-refresh decision matrix (cache_absent,
min_interval_not_elapsed, meta_absent, no_new_records,
watermark_changed), telemetry events, runtime-graph-cold skip,
force_rebuild override, single-flight under contention, end-to-end
refresh without SLEEP, hook TTL (default-on / disabled / garbage env),
and the writer/reader sidecar-path round-trip.
Validated in prod manually (daemon kickstarted on the patched code,
trigger=`periodic_wake` emitted a `_success` event 32s after restart,
hook TTL fall-through observed). The installed hooks under
~/.codex/hooks and ~/.claude/hooks were patched in place to preserve
the local JARVIS_NO_IAI guard; those edits are not committed here.
Designed by Marsu — Refined by Claude
Co-Authored-By: Claude <noreply@anthropic.com>
|
Do NOT merge yet — pending local upgrade validation. The fix has been cross-reviewed by Codex (45/45 tests green, scope clean) and a smoke test on the previous v1.1.7 base showed the WAKE refresh works in prod. Before merging, the local install will be upgraded to clean v1.2.0, the bug confirmed there, then this candidate redeployed and validated end-to-end. Lifting this draft only after that. |
Local validation summaryDeployed and exercised end-to-end against a real install on macOS.
Daemon process is stable (0% CPU, ~3.5% MEM, etime growing nominally, Not merging from my side — leaving merge to whoever owns the upstream protocol. Ready for review. |
CI note: red check is pre-existing on
|
Summary
Validation
Notes