Skip to content

Fix stale SessionStart cache during active WAKE sessions#25

Open
Marsu6996 wants to merge 1 commit into
CodeAbra:mainfrom
Marsu6996:fix/session-start-cache-wake-refresh-v120
Open

Fix stale SessionStart cache during active WAKE sessions#25
Marsu6996 wants to merge 1 commit into
CodeAbra:mainfrom
Marsu6996:fix/session-start-cache-wake-refresh-v120

Conversation

@Marsu6996

Copy link
Copy Markdown
Contributor

Summary

  • refresh the SessionStart cache from WAKE/DROWSY, not only after SLEEP
  • add single-flight, min-interval and watermark sidecar guards for cheap best-effort refreshes
  • make the SessionStart hook TTL default to 12h so stale caches fall through to live recall
  • add coverage for refresh decisions, telemetry, sidecar path consistency, TTL fallback and failure cases

Validation

  • .venv/bin/python -m pytest tests/test_session_start_cache_refresh.py tests/test_session_recall_precache.py tests/test_session_refresh_rpc.py tests/test_frozen_constant_hermeticity.py tests/test_consolidation_single_driver.py
  • 45 passed in 13.25s

Notes

  • This branch is based on origin/main at v1.2.0.
  • It supersedes the earlier branch that was accidentally based on v1.1.7.
  • Local production smoke testing showed periodic_wake writes the cache successfully, with session_start_cache_write_started -> session_start_cache_write_success and no failures.

…LEEP

The session-start precache (~/.iai-mcp/.session-start-payload.cached.md)
was regenerated only by the post-sleep-pipeline hook, so a daemon that
never reached SLEEP — the common case during an active Claude/Codex
session, where HID idle and RPC traffic keep the wrappers warm — left
the precache file frozen for days while the store kept ingesting
records. Field evidence: on one box the cache was 5 days stale with
+8986 active records and zero session_start_cache_write_failed events
in the ledger.

* Add `_maybe_refresh_session_start_cache`, a best-effort WAKE-time
  refresh gated by a single-flight `threading.Lock`, an env-tunable
  min-interval (`IAI_MCP_SESSION_CACHE_REFRESH_MIN_SEC`, default 60s),
  a watermark sidecar JSON (records_count + monotone MAX(vec_label) +
  MAX(created_at) + MAX(updated_at), so a record inserted now with an
  *old* created_at — a transcript backfill — still triggers a refresh),
  and a `_runtime_graph_cache_is_warm` probe that skips with reason
  `runtime_graph_cache_cold` rather than spawn a heavyweight rebuild on
  a live tick. Two new lifecycle-tick call sites: one right after
  `pending_embeddings_wake_sequence` when reembed/ingest moved (the
  reactive path) and one on every WAKE/DROWSY tick (the periodic safety
  net). The SLEEP-pipeline call now passes `force_rebuild=True` —
  inside its EXCLUSIVE lock the rebuild cost is fine.

* Rich telemetry on every refresh attempt:
  - `session_start_cache_write_started` (trigger, cache_path,
    force_rebuild)
  - `session_start_cache_write_success` (rendered_chars, records_count,
    max_vec_label, max_record_created_at, max_updated_at, duration_ms)
  - `session_start_cache_write_skipped` (reason ∈ {refresh_in_progress,
    runtime_graph_cache_cold, empty_render, min_interval_not_elapsed,
    no_new_records, meta_absent, cache_absent, probe_failed})
  - `session_start_cache_write_failed` (reason=ExcType, error, duration)
  Every emit is wrapped so an event-write failure cannot crash the
  refresh path, and the refresh path itself cannot crash the tick.

* `_default_session_start_cache_meta_path` centralises the sidecar
  derivation so the writer and the reader cannot drift onto different
  filenames; a regression test pins the round-trip.

* Hook safety net (iai-mcp-session-recall.sh): default-on 12h TTL via
  `IAI_MCP_SESSION_CACHE_MAX_AGE_SEC` (set to `0` to keep the legacy
  serve-regardless-of-age behaviour explicitly). If the daemon is down
  or its regeneration silently fails, a multi-day-old precache no
  longer leaks into every new session.

Tests: 18 new cases covering watermark probe (vec_label catches
old-created_at inserts), should-refresh decision matrix (cache_absent,
min_interval_not_elapsed, meta_absent, no_new_records,
watermark_changed), telemetry events, runtime-graph-cold skip,
force_rebuild override, single-flight under contention, end-to-end
refresh without SLEEP, hook TTL (default-on / disabled / garbage env),
and the writer/reader sidecar-path round-trip.

Validated in prod manually (daemon kickstarted on the patched code,
trigger=`periodic_wake` emitted a `_success` event 32s after restart,
hook TTL fall-through observed). The installed hooks under
~/.codex/hooks and ~/.claude/hooks were patched in place to preserve
the local JARVIS_NO_IAI guard; those edits are not committed here.

Designed by Marsu — Refined by Claude

Co-Authored-By: Claude <noreply@anthropic.com>
@Marsu6996 Marsu6996 marked this pull request as draft June 27, 2026 07:45
@Marsu6996

Copy link
Copy Markdown
Contributor Author

Do NOT merge yet — pending local upgrade validation. The fix has been cross-reviewed by Codex (45/45 tests green, scope clean) and a smoke test on the previous v1.1.7 base showed the WAKE refresh works in prod. Before merging, the local install will be upgraded to clean v1.2.0, the bug confirmed there, then this candidate redeployed and validated end-to-end. Lifting this draft only after that.

@Marsu6996 Marsu6996 marked this pull request as ready for review June 27, 2026 07:57
@Marsu6996

Copy link
Copy Markdown
Contributor Author

Local validation summary

Deployed and exercised end-to-end against a real install on macOS.

  • Install upgraded from v1.1.7 to v1.2.0 + this candidate fix via pip install --upgrade --no-deps . on the branch tip.
  • Daemon module hash (.venv/.../iai_mcp/daemon/__init__.py) = 9ddc7d59798764819ad5b5c7f298fb98d553dda3 — matches git show 98d606c:src/iai_mcp/daemon/__init__.py exactly.
  • All five new helpers resolve at runtime in the installed module: _maybe_refresh_session_start_cache, _session_start_cache_watermark, _runtime_graph_cache_is_warm, _default_session_start_cache_meta_path, SESSION_START_CACHE_META_PATH.
  • Smoke after kickstart on the candidate:
    • session_start_cache_write_started: 1
    • session_start_cache_write_success: 1 (52s after kickstart)
    • session_start_cache_write_failed: 0
    • trigger="periodic_wake" — the WAKE-side path did the work; no SLEEP cycle needed.
  • Canonical sidecar ~/.iai-mcp/.session-start-payload.cached.meta.json present (not the legacy .cached.md.meta.json), watermark fields populated: records_count, max_vec_label, max_created_at, max_updated_at, rendered_chars, generated_at, trigger.
  • Hooks ~/.claude/hooks/iai-mcp-session-recall.sh and ~/.codex/hooks/iai-mcp-session-recall.sh manually harmonized locally (identical bytes, exec mode 0o755) so both carry the v1.2.0 hook source from this PR + TTL safety net + the local JARVIS_NO_IAI / JARVIS_NO_IAI_RECALL guards. The Jarvis guards are local-only — they are not part of this PR.
  • Targeted tests locally: 45/45 passed (tests/test_session_start_cache_refresh.py + the four non-regression suites).
  • Hook smoke validated against the live daemon:
    • IAI_MCP_SESSION_CACHE_MAX_AGE_SEC=1 + fresh 39s cache → log cache-stale age=39s max=1s -> live-cli, 0 bytes emitted (both hook copies behave identically).
    • IAI_MCP_SESSION_CACHE_MAX_AGE_SEC=3600 + same cache → cache-hit age=39s bytes=7815.
    • JARVIS_NO_IAI=1 → guard short-circuits, 0 bytes, CLI never invoked.

Daemon process is stable (0% CPU, ~3.5% MEM, etime growing nominally, state=running, last exit code=0 from the previous PID).

Not merging from my side — leaving merge to whoever owns the upstream protocol. Ready for review.

@Marsu6996

Copy link
Copy Markdown
Contributor Author

CI note: red check is pre-existing on main, not introduced by this PR

The macOS build & test job is red here, but the failures are unrelated to the change:

  • The 28 tests in this PR's scope all pass: tests/test_session_recall_precache.py 6/6 and tests/test_session_start_cache_refresh.py 22/22 (visible in the run log).
  • The 51 failures are concentrated on socket / daemon dispatcher / launchd FD tests, with three patterns:
    • OSError: AF_UNIX path too long — pytest tmp_path on the GitHub runner produces paths like /private/var/folders/mn/.../sock/d.sock that exceed the 104-byte sun_path limit.
    • AssertionError: socket never bound / daemon did not bind socket within 30s — downstream effect of the same.
    • NameError: name 'tmp_path' is not defined in a few tests (test_mcp_tools.py, test_socket_disconnect_reconnect.py, test_socket_inherit_launchd_fd.py) — fixture not declared in the test signature.

The same job is already failing on main itself since the v1.2.0 release commit:

2026-06-26  Release v1.2.0                              failure
2026-06-25  Update release badge to v1.1.7 in zh README success
2026-06-23  Add Chinese README                          success

(see https://github.com/CodeAbra/iai-personal-memory-engine/actions/runs/28267001306)

Rebasing this PR on main will inherit the same red check until that's addressed upstream.

Happy to open a separate small PR fixing the tmp_path fixture omissions and adding a shorter-prefix override for the AF_UNIX cases if useful — those look orthogonal to this change but trivially fixable. Let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant