Skip to content

Refresh SessionStart cache during WAKE, not only after SLEEP#24

Closed
Marsu6996 wants to merge 1 commit into
CodeAbra:mainfrom
Marsu6996:fix/session-start-cache-wake-refresh
Closed

Refresh SessionStart cache during WAKE, not only after SLEEP#24
Marsu6996 wants to merge 1 commit into
CodeAbra:mainfrom
Marsu6996:fix/session-start-cache-wake-refresh

Conversation

@Marsu6996

Copy link
Copy Markdown
Contributor

Summary

The SessionStart precache (~/.iai-mcp/.session-start-payload.cached.md) is regenerated only inside the SLEEP branch of the lifecycle tick (_write_session_start_cache called from the sleep-pipeline post-hook). For a daemon that never reaches SLEEP — the common case during an active Claude/Codex session, where HID-idle and steady RPC traffic keep the wrappers warm above the SLEEP threshold — the file freezes for days while the store keeps ingesting records. Field evidence from one box: the cache was 5 days stale with +8 986 active records and zero session_start_cache_write_failed events in the ledger (the write never even started).

This PR adds a best-effort WAKE-time refresh and a default-on TTL safety net in the SessionStart shell hook.

Root cause

_write_session_start_cache(store) is called only when current is _LifecycleState.SLEEP. To enter SLEEP the daemon needs HID idle ≥ 30 min (or recent pmset sleep) and ≥ 30 s without RPC. Long-lived sessions almost never satisfy both at once. The hook then re-injects an arbitrarily old precache into every new session.

Fix

  • _maybe_refresh_session_start_cache(store, *, trigger, ...) — best-effort WAKE refresh, gated by:

    • a single-flight threading.Lock (_session_start_cache_lock) — a tick that arrives while another refresh is in flight emits _skipped reason=refresh_in_progress and returns immediately;
    • an env-tunable min-interval (IAI_MCP_SESSION_CACHE_REFRESH_MIN_SEC, default 60 s);
    • a watermark sidecar JSON — records_count + monotone MAX(vec_label) + MAX(created_at) + MAX(updated_at). The monotone vec_label is what catches records inserted now with an old created_at (e.g. a backfilled transcript), which a MAX(created_at) comparison would miss;
    • _runtime_graph_cache_is_warm — wraps retrieve._runtime_graph_rebuild_needed (a disk read + COUNT(*), no child fleet). On cold cache the refresh skips with reason=runtime_graph_cache_cold rather than spawn a heavyweight rebuild inside a live tick.
  • Two new lifecycle-tick call sites:

    • right after pending_embeddings_wake_sequence when reembed_count > 0 or ingest_count > 0 — the reactive path catches every fresh embed batch;
    • on every WAKE/DROWSY tick — the periodic safety net catches everything else.
      Both go through asyncio.to_thread so memory_recall is never blocked.
  • SLEEP call keeps its semantics, now explicit: force_rebuild=True. We hold the EX lock during the consolidation window — the rebuild cost is acceptable there.

  • Rich telemetry, every refresh attempt emits one of:

    • session_start_cache_write_started{trigger, cache_path, force_rebuild}
    • session_start_cache_write_success{rendered_chars, records_count, max_vec_label, max_record_created_at, max_updated_at, duration_ms}
    • session_start_cache_write_skipped{reason ∈ {refresh_in_progress, runtime_graph_cache_cold, empty_render, min_interval_not_elapsed, no_new_records, meta_absent, cache_absent, probe_failed}}
    • session_start_cache_write_failed{reason: ExcType, error: str[:200], duration_ms} (severity=warning)
      Every emit is wrapped so an event-write failure cannot crash the refresh path; the refresh path itself is wrapped so it cannot crash the tick.
  • _default_session_start_cache_meta_path(cache_path) centralises the sidecar derivation so writer and reader cannot drift onto different filenames; a regression test pins the round-trip.

  • Hook TTL (default-on 12 h) in iai-mcp-session-recall.sh: IAI_MCP_SESSION_CACHE_MAX_AGE_SEC defaults to 43 200 s, non-numeric falls back to 43 200 s, set to 0 to disable. If the daemon is down or its regeneration silently fails, a multi-day-old precache no longer leaks into every new session. The legacy test_hook_serves_stale_cache is preserved by passing =0.

Tests

18 new cases in tests/test_session_start_cache_refresh.py covering:

  • watermark probe (including the old-created_at insert that MAX(created_at) would miss);
  • the _should_refresh decision matrix — cache_absent, min_interval_not_elapsed, meta_absent, no_new_records, watermark_changed;
  • _started / _success / _skipped / _failed telemetry shapes;
  • runtime-graph-cold skip + force_rebuild=True override (the SLEEP-path semantic);
  • single-flight under thread contention (one writes, the other gets refresh_in_progress);
  • end-to-end refresh without ever entering SLEEP;
  • hook TTL — default-on falls through, =0 serves stale, garbage env falls back to 12 h;
  • writer/reader sidecar-path round-trip (anti-divergence).

conftest.py is extended to monkeypatch the new SESSION_START_CACHE_META_PATH constant alongside the existing SESSION_START_CACHE_PATH.

Test plan

  • tests/test_session_start_cache_refresh.py — 18 new cases, all green
  • tests/test_session_recall_precache.py — non-regression, including the updated test_hook_serves_stale_cache opting into =0
  • tests/test_session_refresh_rpc.py — non-regression on the RPC refresh path
  • tests/test_frozen_constant_hermeticity.py — non-regression on the constant set
  • tests/test_consolidation_single_driver.py — non-regression on the consolidation harness
  • Total: 45/45 ✅
  • Manual prod validation: kickstarted the LaunchAgent on the patched code, observed _started_success events with trigger="periodic_wake" and a freshly regenerated cache + sidecar, observed cache-stale … -> live-cli in the hook log with a short TTL. Installed hook copies (~/.codex/hooks/, ~/.claude/hooks/) were patched in place to preserve a local guard; those edits are intentionally not included in this PR.

🤖 Designed by Marsu — Refined by Claude

…LEEP

The session-start precache (~/.iai-mcp/.session-start-payload.cached.md)
was regenerated only by the post-sleep-pipeline hook, so a daemon that
never reached SLEEP — the common case during an active Claude/Codex
session, where HID idle and RPC traffic keep the wrappers warm — left
the precache file frozen for days while the store kept ingesting
records. Field evidence: on one box the cache was 5 days stale with
+8986 active records and zero session_start_cache_write_failed events
in the ledger.

* Add `_maybe_refresh_session_start_cache`, a best-effort WAKE-time
  refresh gated by a single-flight `threading.Lock`, an env-tunable
  min-interval (`IAI_MCP_SESSION_CACHE_REFRESH_MIN_SEC`, default 60s),
  a watermark sidecar JSON (records_count + monotone MAX(vec_label) +
  MAX(created_at) + MAX(updated_at), so a record inserted now with an
  *old* created_at — a transcript backfill — still triggers a refresh),
  and a `_runtime_graph_cache_is_warm` probe that skips with reason
  `runtime_graph_cache_cold` rather than spawn a heavyweight rebuild on
  a live tick. Two new lifecycle-tick call sites: one right after
  `pending_embeddings_wake_sequence` when reembed/ingest moved (the
  reactive path) and one on every WAKE/DROWSY tick (the periodic safety
  net). The SLEEP-pipeline call now passes `force_rebuild=True` —
  inside its EXCLUSIVE lock the rebuild cost is fine.

* Rich telemetry on every refresh attempt:
  - `session_start_cache_write_started` (trigger, cache_path,
    force_rebuild)
  - `session_start_cache_write_success` (rendered_chars, records_count,
    max_vec_label, max_record_created_at, max_updated_at, duration_ms)
  - `session_start_cache_write_skipped` (reason ∈ {refresh_in_progress,
    runtime_graph_cache_cold, empty_render, min_interval_not_elapsed,
    no_new_records, meta_absent, cache_absent, probe_failed})
  - `session_start_cache_write_failed` (reason=ExcType, error, duration)
  Every emit is wrapped so an event-write failure cannot crash the
  refresh path, and the refresh path itself cannot crash the tick.

* `_default_session_start_cache_meta_path` centralises the sidecar
  derivation so the writer and the reader cannot drift onto different
  filenames; a regression test pins the round-trip.

* Hook safety net (iai-mcp-session-recall.sh): default-on 12h TTL via
  `IAI_MCP_SESSION_CACHE_MAX_AGE_SEC` (set to `0` to keep the legacy
  serve-regardless-of-age behaviour explicitly). If the daemon is down
  or its regeneration silently fails, a multi-day-old precache no
  longer leaks into every new session.

Tests: 18 new cases covering watermark probe (vec_label catches
old-created_at inserts), should-refresh decision matrix (cache_absent,
min_interval_not_elapsed, meta_absent, no_new_records,
watermark_changed), telemetry events, runtime-graph-cold skip,
force_rebuild override, single-flight under contention, end-to-end
refresh without SLEEP, hook TTL (default-on / disabled / garbage env),
and the writer/reader sidecar-path round-trip.

Validated in prod manually (daemon kickstarted on the patched code,
trigger=`periodic_wake` emitted a `_success` event 32s after restart,
hook TTL fall-through observed). The installed hooks under
~/.codex/hooks and ~/.claude/hooks were patched in place to preserve
the local JARVIS_NO_IAI guard; those edits are not committed here.

Designed by Marsu — Refined by Claude

Co-Authored-By: Claude <noreply@anthropic.com>
@Marsu6996 Marsu6996 marked this pull request as draft June 27, 2026 07:16
@Marsu6996

Copy link
Copy Markdown
Contributor Author

Superseded — wrong base. This PR was opened from the v1.1.7 tag (detached HEAD), which turned out to be older than the current upstream main. A new PR is being prepared against the actual current base; this one will be closed once the replacement is open. Do NOT merge.

@Marsu6996

Copy link
Copy Markdown
Contributor Author

Closing as superseded. The fix has been re-applied on top of the current upstream main (v1.2.0) — see PR #25, which is the canonical successor. This PR was based on a stale v1.1.7 checkout from a detached HEAD; cherry-picking onto main produced PR #25 with identical scope and 45/45 green tests.

@Marsu6996 Marsu6996 closed this Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant