Refresh SessionStart cache during WAKE, not only after SLEEP#24
Closed
Marsu6996 wants to merge 1 commit into
Closed
Refresh SessionStart cache during WAKE, not only after SLEEP#24Marsu6996 wants to merge 1 commit into
Marsu6996 wants to merge 1 commit into
Conversation
…LEEP
The session-start precache (~/.iai-mcp/.session-start-payload.cached.md)
was regenerated only by the post-sleep-pipeline hook, so a daemon that
never reached SLEEP — the common case during an active Claude/Codex
session, where HID idle and RPC traffic keep the wrappers warm — left
the precache file frozen for days while the store kept ingesting
records. Field evidence: on one box the cache was 5 days stale with
+8986 active records and zero session_start_cache_write_failed events
in the ledger.
* Add `_maybe_refresh_session_start_cache`, a best-effort WAKE-time
refresh gated by a single-flight `threading.Lock`, an env-tunable
min-interval (`IAI_MCP_SESSION_CACHE_REFRESH_MIN_SEC`, default 60s),
a watermark sidecar JSON (records_count + monotone MAX(vec_label) +
MAX(created_at) + MAX(updated_at), so a record inserted now with an
*old* created_at — a transcript backfill — still triggers a refresh),
and a `_runtime_graph_cache_is_warm` probe that skips with reason
`runtime_graph_cache_cold` rather than spawn a heavyweight rebuild on
a live tick. Two new lifecycle-tick call sites: one right after
`pending_embeddings_wake_sequence` when reembed/ingest moved (the
reactive path) and one on every WAKE/DROWSY tick (the periodic safety
net). The SLEEP-pipeline call now passes `force_rebuild=True` —
inside its EXCLUSIVE lock the rebuild cost is fine.
* Rich telemetry on every refresh attempt:
- `session_start_cache_write_started` (trigger, cache_path,
force_rebuild)
- `session_start_cache_write_success` (rendered_chars, records_count,
max_vec_label, max_record_created_at, max_updated_at, duration_ms)
- `session_start_cache_write_skipped` (reason ∈ {refresh_in_progress,
runtime_graph_cache_cold, empty_render, min_interval_not_elapsed,
no_new_records, meta_absent, cache_absent, probe_failed})
- `session_start_cache_write_failed` (reason=ExcType, error, duration)
Every emit is wrapped so an event-write failure cannot crash the
refresh path, and the refresh path itself cannot crash the tick.
* `_default_session_start_cache_meta_path` centralises the sidecar
derivation so the writer and the reader cannot drift onto different
filenames; a regression test pins the round-trip.
* Hook safety net (iai-mcp-session-recall.sh): default-on 12h TTL via
`IAI_MCP_SESSION_CACHE_MAX_AGE_SEC` (set to `0` to keep the legacy
serve-regardless-of-age behaviour explicitly). If the daemon is down
or its regeneration silently fails, a multi-day-old precache no
longer leaks into every new session.
Tests: 18 new cases covering watermark probe (vec_label catches
old-created_at inserts), should-refresh decision matrix (cache_absent,
min_interval_not_elapsed, meta_absent, no_new_records,
watermark_changed), telemetry events, runtime-graph-cold skip,
force_rebuild override, single-flight under contention, end-to-end
refresh without SLEEP, hook TTL (default-on / disabled / garbage env),
and the writer/reader sidecar-path round-trip.
Validated in prod manually (daemon kickstarted on the patched code,
trigger=`periodic_wake` emitted a `_success` event 32s after restart,
hook TTL fall-through observed). The installed hooks under
~/.codex/hooks and ~/.claude/hooks were patched in place to preserve
the local JARVIS_NO_IAI guard; those edits are not committed here.
Designed by Marsu — Refined by Claude
Co-Authored-By: Claude <noreply@anthropic.com>
Contributor
Author
|
Superseded — wrong base. This PR was opened from the v1.1.7 tag (detached HEAD), which turned out to be older than the current upstream main. A new PR is being prepared against the actual current base; this one will be closed once the replacement is open. Do NOT merge. |
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The SessionStart precache (
~/.iai-mcp/.session-start-payload.cached.md) is regenerated only inside the SLEEP branch of the lifecycle tick (_write_session_start_cachecalled from the sleep-pipeline post-hook). For a daemon that never reaches SLEEP — the common case during an active Claude/Codex session, where HID-idle and steady RPC traffic keep the wrappers warm above the SLEEP threshold — the file freezes for days while the store keeps ingesting records. Field evidence from one box: the cache was 5 days stale with +8 986 active records and zerosession_start_cache_write_failedevents in the ledger (the write never even started).This PR adds a best-effort WAKE-time refresh and a default-on TTL safety net in the SessionStart shell hook.
Root cause
_write_session_start_cache(store)is called only whencurrent is _LifecycleState.SLEEP. To enter SLEEP the daemon needs HID idle ≥ 30 min (or recent pmset sleep) and ≥ 30 s without RPC. Long-lived sessions almost never satisfy both at once. The hook then re-injects an arbitrarily old precache into every new session.Fix
_maybe_refresh_session_start_cache(store, *, trigger, ...)— best-effort WAKE refresh, gated by:threading.Lock(_session_start_cache_lock) — a tick that arrives while another refresh is in flight emits_skipped reason=refresh_in_progressand returns immediately;IAI_MCP_SESSION_CACHE_REFRESH_MIN_SEC, default 60 s);records_count+ monotoneMAX(vec_label)+MAX(created_at)+MAX(updated_at). The monotonevec_labelis what catches records inserted now with an oldcreated_at(e.g. a backfilled transcript), which aMAX(created_at)comparison would miss;_runtime_graph_cache_is_warm— wrapsretrieve._runtime_graph_rebuild_needed(a disk read +COUNT(*), no child fleet). On cold cache the refresh skips withreason=runtime_graph_cache_coldrather than spawn a heavyweight rebuild inside a live tick.Two new lifecycle-tick call sites:
pending_embeddings_wake_sequencewhenreembed_count > 0oringest_count > 0— the reactive path catches every fresh embed batch;Both go through
asyncio.to_threadsomemory_recallis never blocked.SLEEP call keeps its semantics, now explicit:
force_rebuild=True. We hold the EX lock during the consolidation window — the rebuild cost is acceptable there.Rich telemetry, every refresh attempt emits one of:
session_start_cache_write_started—{trigger, cache_path, force_rebuild}session_start_cache_write_success—{rendered_chars, records_count, max_vec_label, max_record_created_at, max_updated_at, duration_ms}session_start_cache_write_skipped—{reason ∈ {refresh_in_progress, runtime_graph_cache_cold, empty_render, min_interval_not_elapsed, no_new_records, meta_absent, cache_absent, probe_failed}}session_start_cache_write_failed—{reason: ExcType, error: str[:200], duration_ms}(severity=warning)Every emit is wrapped so an event-write failure cannot crash the refresh path; the refresh path itself is wrapped so it cannot crash the tick.
_default_session_start_cache_meta_path(cache_path)centralises the sidecar derivation so writer and reader cannot drift onto different filenames; a regression test pins the round-trip.Hook TTL (default-on 12 h) in
iai-mcp-session-recall.sh:IAI_MCP_SESSION_CACHE_MAX_AGE_SECdefaults to 43 200 s, non-numeric falls back to 43 200 s, set to0to disable. If the daemon is down or its regeneration silently fails, a multi-day-old precache no longer leaks into every new session. The legacytest_hook_serves_stale_cacheis preserved by passing=0.Tests
18 new cases in
tests/test_session_start_cache_refresh.pycovering:created_atinsert thatMAX(created_at)would miss);_should_refreshdecision matrix —cache_absent,min_interval_not_elapsed,meta_absent,no_new_records,watermark_changed;_started/_success/_skipped/_failedtelemetry shapes;force_rebuild=Trueoverride (the SLEEP-path semantic);refresh_in_progress);=0serves stale, garbage env falls back to 12 h;conftest.pyis extended to monkeypatch the newSESSION_START_CACHE_META_PATHconstant alongside the existingSESSION_START_CACHE_PATH.Test plan
tests/test_session_start_cache_refresh.py— 18 new cases, all greentests/test_session_recall_precache.py— non-regression, including the updatedtest_hook_serves_stale_cacheopting into=0tests/test_session_refresh_rpc.py— non-regression on the RPC refresh pathtests/test_frozen_constant_hermeticity.py— non-regression on the constant settests/test_consolidation_single_driver.py— non-regression on the consolidation harness_started→_successevents withtrigger="periodic_wake"and a freshly regenerated cache + sidecar, observedcache-stale … -> live-cliin the hook log with a short TTL. Installed hook copies (~/.codex/hooks/,~/.claude/hooks/) were patched in place to preserve a local guard; those edits are intentionally not included in this PR.🤖 Designed by Marsu — Refined by Claude